Most active commenters

Thorrez(7)
chefandy(4)

Popular/hot comments

>>42146244 #
>>42146749 #
>>42146884 #
>>42147570 #
>>42146560 #
>>42146959 #
>>42147314 #

←back to thread

Something weird is happening with LLMs and chess

(dynomight.substack.com)

1. codeflo ◴[15 Nov 24 10:52 UTC] No.42145710[source]▶

>>42138289 (OP) #

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

replies(3): >>42146244 #>>42146391 #>>42151266 #

2. darkerside ◴[15 Nov 24 12:21 UTC] No.42146244[source]▶

>>42145710 (TP) #

VW got in a lot of trouble for this

replies(10): >>42146543 #>>42146550 #>>42146553 #>>42146556 #>>42146560 #>>42147093 #>>42147124 #>>42147353 #>>42147357 #>>42148300 #

3. ◴[15 Nov 24 12:44 UTC] No.42146391[source]▶

>>42145710 (TP) #

4. TrueDuality ◴[15 Nov 24 13:06 UTC] No.42146543[source]▶

>>42146244 #

Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).

replies(1): >>42146761 #

5. close04 ◴[15 Nov 24 13:07 UTC] No.42146550[source]▶

>>42146244 #

Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.

Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.

Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.

6. Swenrekcah ◴[15 Nov 24 13:08 UTC] No.42146553[source]▶

>>42146244 #

Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.

7. ArnoVW ◴[15 Nov 24 13:08 UTC] No.42146556[source]▶

>>42146244 #

True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.

It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.

replies(1): >>42148055 #

8. sigmoid10 ◴[15 Nov 24 13:08 UTC] No.42146560[source]▶

>>42146244 #

Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.

replies(3): >>42146749 #>>42147885 #>>42150309 #

9. 0xFF0123 ◴[15 Nov 24 13:32 UTC] No.42146749{3}[source]▶

>>42146560 #

The only difference is the legality. From an integrity point of view it's basically the same

replies(7): >>42146884 #>>42146984 #>>42147072 #>>42147078 #>>42147443 #>>42147742 #>>42147978 #

10. krisoft ◴[15 Nov 24 13:33 UTC] No.42146761{3}[source]▶

>>42146543 #

> VW got in trouble for running _different_ software in test vs prod.

Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.

The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.

replies(1): >>42147479 #

11. Thorrez ◴[15 Nov 24 13:47 UTC] No.42146884{4}[source]▶

>>42146749 #

I think breaking a law is more unethical than not breaking a law.

Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.

replies(4): >>42146959 #>>42147070 #>>42147439 #>>42147666 #

12. Lalabadie ◴[15 Nov 24 13:55 UTC] No.42146959{5}[source]▶

>>42146884 #

> I think breaking a law is more unethical than not breaking a law.

It sounds like a mismatch of definition, but I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical. Law is the codification and enforcement of a social contract, not the creation of it.

replies(3): >>42147314 #>>42147369 #>>42148090 #

13. UniverseHacker ◴[15 Nov 24 13:58 UTC] No.42146984{4}[source]▶

>>42146749 #

I disagree- presumably if an algorithm or hardware is optimized for a certain class of problem it really is good at it and always will be- which is still useful if you are actually using it for that. It’s just “studying for the test”- something I would expect to happen even if it is a bit misleading.

VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."

14. Winse ◴[15 Nov 24 14:08 UTC] No.42147070{5}[source]▶

>>42146884 #

unless following an unethical law would in itself be unethical, then breaking the unethical law would be the only ethical choice. In this case cheating emissions, which I see as unethical, but also advantageous for the consumer, should have been done openly if VW saw following the law as unethical. Ethics and morality are subjective to understanding, and law only a crude approximation of divinity. Though I would argue that each person on the earth through a shared common experience has a rough and general idea of right from wrong...though I'm not always certain they pay attention to it.

15. the_af ◴[15 Nov 24 14:09 UTC] No.42147072{4}[source]▶

>>42146749 #

> The only difference is the legality. From an integrity point of view it's basically the same

I think cheating about harming the environment is another important difference.

16. Swenrekcah ◴[15 Nov 24 14:10 UTC] No.42147078{4}[source]▶

>>42146749 #

That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:

Key differences:

1. Intent and harm: • VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms. 2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal. 3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.

17. tightbookkeeper ◴[15 Nov 24 14:11 UTC] No.42147093[source]▶

>>42146244 #

This is 10 year old story. It’s very interesting which ones stay in the public consciousness.

18. bluGill ◴[15 Nov 24 14:14 UTC] No.42147124[source]▶

>>42146244 #

Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.

These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.

19. Thorrez ◴[15 Nov 24 14:39 UTC] No.42147314{6}[source]▶

>>42146959 #

>I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical.

There are many cases where I think that. Examples:

* Underage drinking. If it's legal for someone to drink, I think it's in general ethical. If it's illegal, I think it's in general unethical.

* Tax avoidance strategies. If the IRS says a strategy is allowed, I think it's ethical. If the IRS says a strategy is not allowed, I think it's unethical.

* Right on red. If the government says right on red is allowed, I think it's ethical. If the government (e.g. NYC) says right on red is not allowed, I think it's unethical.

The VW case was emissions regulations. I think they have an ethical obligation to obey emissions regulations. In the absence of regulations, it's not an obvious ethical problem to prioritize fuel efficiency instead of emissions (that's I believe what VW was doing).

replies(3): >>42147570 #>>42148734 #>>42156023 #

20. newerman ◴[15 Nov 24 14:44 UTC] No.42147353[source]▶

>>42146244 #

Funny response; you're not wrong.

21. conradev ◴[15 Nov 24 14:44 UTC] No.42147357[source]▶

>>42146244 #

GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?

replies(1): >>42147748 #

22. emn13 ◴[15 Nov 24 14:45 UTC] No.42147369{6}[source]▶

>>42146959 #

Also, while laws ideally are inspired by an ethical social contract, the codification proces is long, complex and far from perfect. And then for rules concerning permissible behavior even in the best of cases, it's enforced extremely sparingly simply because it's not possible nor desirable to detect and deal with all infractions. Nor is it applied blindly and equally. As actually applied, a law is definitely not even close to some ethical ideal; sometimes it's outright opposed to it, even.

Law and ethics are barely related, in practice.

For example in the vehicle emissions context, it's worth noting that even well before VW was caught the actions of likely all carmakers affected by the regulations (not necessarily to the same extent) were clearly unethical. The rules had been subject to intense clearly unethical lobbying for years, and so even the legal lab results bore little resemblance to practical on-the-road results though systematic (yet legal) abuse. I wouldn't be surprised to learn that even what was measured intentionally diverged from what is harmfully in a profitable way. It's a good thing VW was made an example of - but clearly it's not like that resolved the general problem of harmful vehicle emissions. Optimistically, it might have signaled to the rest of the industry and VW in particular to stretch the rules less in the future.

23. hansworst ◴[15 Nov 24 14:54 UTC] No.42147439{5}[source]▶

>>42146884 #

Overfitting on test data absolutely does mean that the model would perform better in benchmarks than it would in real life use cases.

replies(1): >>42158947 #

24. currymj ◴[15 Nov 24 14:54 UTC] No.42147443{4}[source]▶

>>42146749 #

VW was breaking the law in a way that harmed society but arguably helped the individual driver of the VW car, who gets better performance yet still passes the emissions test.

replies(2): >>42147637 #>>42149872 #

25. TrueDuality ◴[15 Nov 24 14:58 UTC] No.42147479{4}[source]▶

>>42146761 #

I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".

replies(1): >>42149146 #

26. chefandy ◴[15 Nov 24 15:08 UTC] No.42147570{7}[source]▶

>>42147314 #

Drinking and right turns are unethical if they’re negligent. They’re not unethical if they’re not negligent. The government is trying to reduce negligence by enacting preventative measures to stop ALL right turns and ALL drinking in certain contexts that are more likely to yield negligence, or where the negligence world be particularly harmful, but that doesn’t change whether or not the behavior itself is negligent.

You might consider disregarding the government’s preventative measures unethical, and doing those things might be the way someone disregards the governments protective guidelines, but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical.

To use a clearer example, the ethicality of abortion— regardless of what you think of it— is not changed by its legal status. You might consider violating the law unethical, so breaking abortion laws would constitute the same ethical violation as underage drinking, but those laws don’t change the ethics of abortion itself. People who consider it unethical still consider it unethical where it’s legal, and those that consider it ethical still consider it ethical where it’s not legal.

replies(4): >>42147856 #>>42148191 #>>42148730 #>>42157977 #

27. jimmaswell ◴[15 Nov 24 15:15 UTC] No.42147637{5}[source]▶

>>42147443 #

And afaik the emissions were still miles ahead of a car from 20 years prior, just not quite as extremely stringent as requested.

replies(1): >>42148188 #

28. Retr0id ◴[15 Nov 24 15:19 UTC] No.42147666{5}[source]▶

>>42146884 #

ethics should inform law, not the reverse

replies(1): >>42158917 #

29. boringg ◴[15 Nov 24 15:27 UTC] No.42147742{4}[source]▶

>>42146749 #

How so? VW intentionally changed the operation of the vehicle so that its emissions met the test requirements during the test and then went back to typical operation conditions afterwards.

30. GolfPopper ◴[15 Nov 24 15:27 UTC] No.42147748{3}[source]▶

>>42147357 #

I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.

replies(1): >>42147861 #

31. adgjlsfhk1 ◴[15 Nov 24 15:39 UTC] No.42147856{8}[source]▶

>>42147570 #

the right on red example is interesting because in that case, the law changes how other drivers and pedestrians will behave in ways that make it pretty much always unsafe

replies(1): >>42148048 #

32. conradev ◴[15 Nov 24 15:39 UTC] No.42147861{4}[source]▶

>>42147748 #

I see – my initial interpretation of OP’s “special case” was “Theory 2: GPT-3.5-instruct was trained on more chess games.”

But I guess it’s also a possibility that they had a real chess engine hiding in there.

33. rsynnott ◴[15 Nov 24 15:43 UTC] No.42147885{3}[source]▶

>>42146560 #

There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.

It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.

It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.

Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.

34. TimTheTinker ◴[15 Nov 24 15:52 UTC] No.42147978{4}[source]▶

>>42146749 #

Right - in either case it's lying, which is crossing a moral line (which is far more important to avoid than a legal line).

35. chefandy ◴[15 Nov 24 15:57 UTC] No.42148048{9}[source]▶

>>42147856 #

That just changes the parameters of negligence. On a country road in the middle of a bunch of farm land where you can see for miles, it doesn’t change a thing.

36. K0balt ◴[15 Nov 24 15:58 UTC] No.42148055{3}[source]▶

>>42146556 #

This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.

I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.

What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.

37. mbrock ◴[15 Nov 24 16:02 UTC] No.42148090{6}[source]▶

>>42146959 #

But following the law is itself a load bearing aspect of the social contract. Violating building codes, for example, might not cause immediate harm if it's competent but unusual, yet it's important that people follow it just because you don't want arbitrariness in matters of safety. The objective ruleset itself is a value beyond the rules themselves, if the rules are sensible and in accordance with deeper values, which of course they sometimes aren't, in which case we value civil disobedience and activism.

38. slowmotiony ◴[15 Nov 24 16:11 UTC] No.42148188{6}[source]▶

>>42147637 #

"not quite as extremely stringent as requested" is a funny way to say they were emitting 40 times more toxic fumes than permitted by law.

replies(1): >>42201013 #

39. mbrock ◴[15 Nov 24 16:11 UTC] No.42148191{8}[source]▶

>>42147570 #

It's not so simple. An analogy is the Rust formatter that has no options so everyone just uses the same style. It's minimally "unethical" to use idiosyncratic Rust style just because it goes against the convention so people will wonder why you're so special, etc.

If the rules themselves are bad and go against deeper morality, then it's a different situation; violating laws out of civil disobedience, emergent need, or with a principled stance is different from wanton, arbitrary, selfish cheating.

If a law is particularly unjust, violating the law might itself be virtuous. If the law is adequate and sensible, violating it is usually wrong even if the violating action could be legal in another sensible jurisdiction.

40. gdiamos ◴[15 Nov 24 16:20 UTC] No.42148300[source]▶

>>42146244 #

It’s approximately bad, like most of ML

On one side:

Would you expect a model trained on no Spanish data to do well on Spanish?

On the other:

Is it okay to train on the MMLU test set?

41. ClumsyPilot ◴[15 Nov 24 17:07 UTC] No.42148730{8}[source]▶

>>42147570 #

> but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical

That is, sometimes, sufficient.

If government says ‘seller of a house must disclose issues’ then I rely rely on the law being followed, if you sell and leave the country, you have defrauded me.

However if I live in a ‘buyer beware’ jurisdiction, then I know I cannot trust the seller and I hire a surveyor and take insurance.

There is a degree of setting expectations- if there is a rule, even if it’s a terrible rule, I as individual can at least take some countermeasures.

You can’t take countermeasures against all forms of illegal behaviour, because there is infinite number of them. And a truly insane person is unpredictable at all.

42. banannaise ◴[15 Nov 24 17:08 UTC] No.42148734{7}[source]▶

>>42147314 #

Outsourcing your morality to politicians past and present is not a particularly useful framework.

replies(2): >>42150043 #>>42158851 #

43. krisoft ◴[15 Nov 24 17:47 UTC] No.42149146{5}[source]▶

>>42147479 #

“Test” environment is the domain of prototype cars driving at the proving ground. It is an internal affair, only for employees and contractors. The software is compiled on some engineer’s laptop and uploaded on the ECU by an engineer manually. No two cars are ever the same, everything is in flux. The number of cars are small.

“Production” is a factory line producing cars. The software is uploaded on the ECUs by some factory machine automatically. Each car are exactly the same, with the exact same software version on thousands and thousands of cars. The cars are sold to customers.

Some small number of these prodiction cars are sent for regulatory compliance checks to third parties. But those cars won’t become suddenly non-production cars just because someone sticks up a probe in their exhausts. The same way gmail’s production servers don’t suddenly turn into test environments just because a user opens the network tab in their browser’s dev tool to see what kind of requests fly on the wire.

44. int_19h ◴[15 Nov 24 19:09 UTC] No.42149872{5}[source]▶

>>42147443 #

It might sound funny in retrospect, but some of us actually bought VW cars on the assumption that, if biodiesel-powered, it would be more green.

45. anonymouskimmer ◴[15 Nov 24 19:28 UTC] No.42150043{8}[source]▶

>>42148734 #

Ethics are only morality if you spend your entire time in human social contexts. Otherwise morality is a bit larger, and ethics are a special case of group recognized good and bad behaviors.

replies(1): >>42216441 #

46. waffletower ◴[15 Nov 24 19:59 UTC] No.42150309{3}[source]▶

>>42146560 #

Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.

47. dang ◴[15 Nov 24 21:23 UTC] No.42151266[source]▶

>>42145710 (TP) #

We detached this subthread from https://news.ycombinator.com/item?id=42144784.

(Nothing wrong with it! It's just a bit more generic than the original topic.)

48. darkerside ◴[16 Nov 24 12:13 UTC] No.42156023{7}[source]▶

>>42147314 #

Lawful good. Or perhaps even lawful neutral?

What if I make sure to have a drink once a week for the summer with my 18 year old before they go to college because I want them to understand what it's like before they go binge with friends? Is that not ethical?

Speeding to the hospital in an emergency? Lying to Nazis to save a Jew?

Law and ethics are more correlated than some are saying here, but the map is not the territory, and it never will be.

replies(1): >>42158884 #

49. Thorrez ◴[16 Nov 24 18:17 UTC] No.42157977{8}[source]▶

>>42147570 #

I agree if they're negligent they're unethical. But I also think if they're illegal they're generally unethical. In situations where some other right is more important that the law, underage drinking or illegal right on red would be ethical, such as if alcohol is needed as an emergency pain reliever, or a small amount for religious worship, or if you need to drive to the hospital fast in an emergency.

Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal. I'm not contesting in any way that legal things can be unethical. Abortion supporters view it as a human right, and that right is more important than the law.

Right on red, underage drinking, and increasing car emissions aren't human rights. So outside of extenuating circumstances, if they're illegal, I see them as unethical.

replies(1): >>42209418 #

50. Thorrez ◴[16 Nov 24 20:06 UTC] No.42158851{8}[source]▶

>>42148734 #

I'm not outsourcing my morality. There are plenty of actions that are legal that are immoral.

I don't think the government's job is to enforce morality. The government's job is to set up a framework for society to help people get along.

51. Thorrez ◴[16 Nov 24 20:11 UTC] No.42158884{8}[source]▶

>>42156023 #

There can be situations where someone's rights are more important than the law. In that case it's ethical to break the law. Speeding to the hospital and lying to Nazis are cases of that. The drinking with your 18 year old, I'm not sure, maybe.

My point though, is that in general, when there's not a right that outweighs the law, it's unethical to break the law.

52. Thorrez ◴[16 Nov 24 20:14 UTC] No.42158917{6}[source]▶

>>42147666 #

I agree that ethics should inform law. But I live in a society, and have an ethical duty to respect other members of society. And part of that duty is following the laws of society.

53. Thorrez ◴[16 Nov 24 20:17 UTC] No.42158947{6}[source]▶

>>42147439 #

I think you're talking about something different from what sigmoid10 was talking about. sigmoid10 said "manipulate behaviour during those benchmarks". I interpreted that to mean the compiler detects if a benchmark is going on and alters its behavior only then. So this wouldn't impact real life use cases.

54. linksnapzz ◴[21 Nov 24 04:09 UTC] No.42201013{7}[source]▶

>>42148188 #

40x infinitesimal is still...infinitesimal.

55. chefandy ◴[21 Nov 24 22:38 UTC] No.42209418{9}[source]▶

>>42157977 #

> Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal.

So it doesn't matter that a very small percentage of the world's population believes life begins at conception, it's still unethical? Or is everything unethical that anyone thinks is unethical across the board, regardless of the other factors? Since some vegans believe eating honey is unethical, does that mean it's unethical for everybody, or would it only be unethical if it was illegal?

In autocracies where all newly married couples were legally compelled to allow the local lord to rape the bride before they consummated the marriage, avoiding that would be unethical?

Were the sit-in protest of the American civil rights era unethical? They were illegal.

Was it unethical to hide people from the Nazis when they were search for people to exterminate? It was against the law.

Was apartheid ethical? It was the law.

Was slavery ethical? It was the law.

Were the jim crow laws ethical?

I have to say, I just fundamentally don't understand your faith in the infallibility of humanity's leaders and governing structures. Do I think it's generally a good idea to follow the law? Of course. But there are so very many laws that are clearly unethical. I think your conflating legal correctness with mores with core foundational ethics is rather strange.

56. chefandy ◴[22 Nov 24 18:57 UTC] No.42216441{9}[source]▶

>>42150043 #

I don't think "ethics" implies group recognition, though-- I'd call those principles mores.

↑