(dynomight.substack.com)

696 points crescit_eundo | 3 comments | 14 Nov 24 17:05 UTC | HN request time: 0.001s | source

Show context

swiftcoder ◴[15 Nov 24 07:57 UTC] No.42144784[source]▶

I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.

replies(8): >>42145306 #>>42145352 #>>42145619 #>>42145811 #>>42145883 #>>42146777 #>>42148148 #>>42151081 #

1. amelius ◴[15 Nov 24 10:38 UTC] No.42145619[source]▶

>>42144784 #

To be fair, they say

> Theory 2: GPT-3.5-instruct was trained on more chess games.

replies(1): >>42146129 #

2. AstralStorm ◴[15 Nov 24 12:00 UTC] No.42146129[source]▶

>>42145619 (TP) #

If that were the case, pumping big Llama chock full of chess games would produce good results. It didn't.

The only way it could be true is if that model recognized and replayed the answer to the game from memory.

replies(1): >>42146631 #

3. yorwba ◴[15 Nov 24 13:17 UTC] No.42146631[source]▶

>>42146129 #

Do you have a link to the results from fine-tuning a Llama model on chess? How do they compare to the base models in the article here?

↑

Something weird is happening with LLMs and chess