Most active commenters

scott_w(6)
Jerrrrrrry(3)
cruffle_duffle(3)

Popular/hot comments

>>42145811 #
>>42146086 #
>>42146383 #
>>42146582 #
>>42145408 #
>>42145905 #
>>42146152 #

←back to thread

Something weird is happening with LLMs and chess

(dynomight.substack.com)

1. swiftcoder ◴[15 Nov 24 07:57 UTC] No.42144784[source]▶

>>42138289 (OP) #

I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.

replies(8): >>42145306 #>>42145352 #>>42145619 #>>42145811 #>>42145883 #>>42146777 #>>42148148 #>>42151081 #

2. INTPenis ◴[15 Nov 24 09:36 UTC] No.42145306[source]▶

>>42144784 (TP) #

Of course it's a benchmark worth winning, has been since Watson. And before that even with mechanical turks.

3. dmurray ◴[15 Nov 24 09:46 UTC] No.42145352[source]▶

>>42144784 (TP) #

This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?

replies(1): >>42145408 #

4. shaky-carrousel ◴[15 Nov 24 09:56 UTC] No.42145408[source]▶

>>42145352 #

That's easy to test, invent a new chess variant and see how the model does.

replies(3): >>42145466 #>>42145557 #>>42146160 #

5. gliptic ◴[15 Nov 24 10:06 UTC] No.42145466{3}[source]▶

>>42145408 #

Both an LLM and Stockfish would fail that test.

replies(1): >>42146130 #

6. andy_ppp ◴[15 Nov 24 10:26 UTC] No.42145557{3}[source]▶

>>42145408 #

You're imagining LLMs don't just regurgitate and recombine things they already know from things they have seen before. A new variant would not be in the dataset so would not be understood. In fact this is quite a good way to show LLMs are NOT thinking or understanding anything in the way we understand it.

replies(2): >>42145905 #>>42147218 #

7. amelius ◴[15 Nov 24 10:38 UTC] No.42145619[source]▶

>>42144784 (TP) #

To be fair, they say

> Theory 2: GPT-3.5-instruct was trained on more chess games.

replies(1): >>42146129 #

8. scott_w ◴[15 Nov 24 11:10 UTC] No.42145811[source]▶

>>42144784 (TP) #

I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.

replies(5): >>42146086 #>>42146152 #>>42146383 #>>42146415 #>>42156785 #

9. bambax ◴[15 Nov 24 11:19 UTC] No.42145883[source]▶

>>42144784 (TP) #

Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.

That's the problem with closed models, we can never know what they're doing.

10. shaky-carrousel ◴[15 Nov 24 11:23 UTC] No.42145905{4}[source]▶

>>42145557 #

Yes, that's how you can really tell if the model is doing real thinking and not recombinating things. If it can correctly play a novel game, then it's doing more than that.

replies(3): >>42146014 #>>42146022 #>>42146190 #

11. dwighttk ◴[15 Nov 24 11:39 UTC] No.42146014{5}[source]▶

>>42145905 #

No LLM model is doing any thinking.

replies(1): >>42146320 #

12. jahnu ◴[15 Nov 24 11:40 UTC] No.42146022{5}[source]▶

>>42145905 #

I wonder what the minimal amount of change qualifies as novel?

"Chess but white and black swap their knights" for example?

replies(1): >>42147158 #

13. fires10 ◴[15 Nov 24 11:52 UTC] No.42146086[source]▶

>>42145811 #

Recognize and hand over to a specialist engine? That might be useful for AI. Maybe I am missing something.

replies(5): >>42146145 #>>42146293 #>>42146329 #>>42147558 #>>42151536 #

14. AstralStorm ◴[15 Nov 24 12:00 UTC] No.42146129[source]▶

>>42145619 #

If that were the case, pumping big Llama chock full of chess games would produce good results. It didn't.

The only way it could be true is if that model recognized and replayed the answer to the game from memory.

replies(1): >>42146631 #

15. delusional ◴[15 Nov 24 12:00 UTC] No.42146130{4}[source]▶

>>42145466 #

Nobody is claiming that Stockfish is learning generalizable concepts that can one day meaningfully replace people in value creating work.

replies(1): >>42146756 #

16. worewood ◴[15 Nov 24 12:04 UTC] No.42146145{3}[source]▶

>>42146086 #

It's because this is standard practice since the early days - there's nothing newsworthy in this at all.

17. Kiro ◴[15 Nov 24 12:06 UTC] No.42146152[source]▶

>>42145811 #

That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).

replies(3): >>42146324 #>>42147204 #>>42151029 #

18. dmurray ◴[15 Nov 24 12:06 UTC] No.42146160{3}[source]▶

>>42145408 #

In both scenarios it would perform poorly on that.

If the chess specialization was done through reinforcement learning, that's not going to transfer to your new variant, any more than access to Stockfish would help it.

19. timdiggerm ◴[15 Nov 24 12:12 UTC] No.42146190{5}[source]▶

>>42145905 #

By that standard (and it is a good standard), none of these "AI" things are doing any thinking

replies(1): >>42147408 #

20. generic92034 ◴[15 Nov 24 12:30 UTC] No.42146293{3}[source]▶

>>42146086 #

How do you think AI are (correctly) solving simple mathematical questions which they have not trained for directly? They hand it over to a specialist maths engine.

replies(1): >>42149781 #

21. selestify ◴[15 Nov 24 12:33 UTC] No.42146320{6}[source]▶

>>42146014 #

How do you define thinking?

replies(2): >>42146586 #>>42151638 #

22. nerdponx ◴[15 Nov 24 12:34 UTC] No.42146324{3}[source]▶

>>42146152 #

Ironically it's probably a lot closer to what a super-human AGI would look like in practice, compared to just an LLM alone.

replies(2): >>42146675 #>>42149673 #

23. nerdponx ◴[15 Nov 24 12:35 UTC] No.42146329{3}[source]▶

>>42146086 #

It is and would be useful, but it would be quite a big lie to the public, but more importantly to paying customers, and even more importantly to investors.

replies(1): >>42148826 #

24. gamerDude ◴[15 Nov 24 12:43 UTC] No.42146383[source]▶

>>42145811 #

This is exactly what I feel AI needs. A manager AI that then hands off things to specialized more deterministic algorithms/machines.

replies(4): >>42146397 #>>42147292 #>>42150449 #>>42152158 #

25. criley2 ◴[15 Nov 24 12:45 UTC] No.42146397{3}[source]▶

>>42146383 #

Basically what Wolfram Alpha rolled out 15 years ago.

It was impressive then, too.

replies(1): >>42150365 #

26. antifa ◴[15 Nov 24 12:47 UTC] No.42146415[source]▶

>>42145811 #

TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.

replies(1): >>42146582 #

27. PittleyDunkin ◴[15 Nov 24 13:11 UTC] No.42146582{3}[source]▶

>>42146415 #

We already have the chess "calculator", though. It's called stockfish. I don't know why you'd ask a dictionary how to solve a math problem.

replies(4): >>42146684 #>>42147106 #>>42149986 #>>42162440 #

28. antononcube ◴[15 Nov 24 13:12 UTC] No.42146586{7}[source]▶

>>42146320 #

Being fast at doing linear algebra computations. (Is there any other kind?!)

29. yorwba ◴[15 Nov 24 13:17 UTC] No.42146631{3}[source]▶

>>42146129 #

Do you have a link to the results from fine-tuning a Llama model on chess? How do they compare to the base models in the article here?

30. sanderjd ◴[15 Nov 24 13:22 UTC] No.42146675{4}[source]▶

>>42146324 #

Right. To me, this is the "agency" thing, that I still feel like is somewhat missing in contemporary AI, despite all the focus on "agents".

If I tell an "agent", whether human or artificial, to win at chess, it is a good decision for that agent to decide to delegate that task to a system that is good at chess. This would be obvious to a human agent, so presumably it should be obvious to an AI as well.

This isn't useful for AI researchers, I suppose, but it's more useful as a tool.

(This may all be a good thing, as giving AIs true agency seems scary.)

replies(1): >>42147515 #

31. iamacyborg ◴[15 Nov 24 13:23 UTC] No.42146684{4}[source]▶

>>42146582 #

People ask LLM’s to do all sorts of things they’re not good at.

32. droopyEyelids ◴[15 Nov 24 13:32 UTC] No.42146756{5}[source]▶

>>42146130 #

The point was such a question could not be used to tell whether the llm was calling a chess engine

replies(1): >>42147700 #

33. jackcviers3 ◴[15 Nov 24 13:34 UTC] No.42146777[source]▶

>>42144784 (TP) #

Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?

As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.

34. the_af ◴[15 Nov 24 14:12 UTC] No.42147106{4}[source]▶

>>42146582 #

A generalist AI with a "chatty" interface that delegates to specialized modules for specific problem-solving seems like a good system to me.

"It looks like you're writing a letter" ;)

replies(1): >>42147436 #

35. the_af ◴[15 Nov 24 14:18 UTC] No.42147158{6}[source]▶

>>42146022 #

I wonder what would happen with a game that is mostly chess (or chess with truly minimal variations) but with all the names changed (pieces, moves, "check", etc, all changed). The algebraic notation is also replaced with something else so it cannot be pattern matched against the training data. Then you list the rules (which are mostly the same as chess).

None of these changes are explained to the LLM, so if it can tell it's still chess, it must deduce this on its own.

Would any LLM be able to play at a decent level?

replies(1): >>42152352 #

36. empath75 ◴[15 Nov 24 14:23 UTC] No.42147204{3}[source]▶

>>42146152 #

The point of creating a service like this is for it to be useful, and if recognizing and handing off tasks to specialized agents isn't useful, i don't know what is.

replies(1): >>42147547 #

37. empath75 ◴[15 Nov 24 14:25 UTC] No.42147218{4}[source]▶

>>42145557 #

You say this quite confidently, but LLMs do generalize somewhat.

38. spiderfarmer ◴[15 Nov 24 14:35 UTC] No.42147292{3}[source]▶

>>42146383 #

Multi Agent LLM's are already a thing.

replies(1): >>42148751 #

39. Jerrrrrrry ◴[15 Nov 24 14:51 UTC] No.42147408{6}[source]▶

>>42146190 #

musical goalposts, gotta love it.

These LLM's just exhibited agency.

Swallow your pride.

replies(1): >>42147976 #

40. datadrivenangel ◴[15 Nov 24 14:54 UTC] No.42147436{5}[source]▶

>>42147106 #

Lets clip this in the bud before it grows wings.

replies(1): >>42150584 #

41. scott_w ◴[15 Nov 24 15:02 UTC] No.42147515{5}[source]▶

>>42146675 #

If this was part of the offering: “we can recognise requests and delegate them to appropriate systems,” I’d understand and be somewhat impressed but the marketing hype is missing this out.

Most likely because they want people to think the system is better than it is for hype purposes.

I should temper my level of impressed with only if it’s doing this dynamically . Hardcoding recognition of chess moves isn’t exactly a difficult trick to pull given there’s like 3 standard formats…

replies(2): >>42148468 #>>42149134 #

42. scott_w ◴[15 Nov 24 15:05 UTC] No.42147547{4}[source]▶

>>42147204 #

If I was sold a product that can generically solve problems I’d feel a bit ripped off if I’m told after purchase that I need to build my own problem solver and way to recognise it…

replies(1): >>42151049 #

43. scott_w ◴[15 Nov 24 15:06 UTC] No.42147558{3}[source]▶

>>42146086 #

If I was sold a general AI problem solving system, I’d feel ripped off if I learned that I needed to build my own problem solver and hook it up after I’d paid my money…

44. delusional ◴[15 Nov 24 15:22 UTC] No.42147700{6}[source]▶

>>42146756 #

Ah okay, I missed that.

45. samatman ◴[15 Nov 24 15:52 UTC] No.42147976{7}[source]▶

>>42147408 #

"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.

replies(1): >>42148394 #

46. oezi ◴[15 Nov 24 16:07 UTC] No.42148148[source]▶

>>42144784 (TP) #

Maybe they even delegate it to a chess engine internally via the tool use and the LLM uses that.

47. Jerrrrrrry ◴[15 Nov 24 16:29 UTC] No.42148394{8}[source]▶

>>42147976 #

  >'thinking' vs 'just recombinating things

If there is a difference, and LLM's can do one but not the other...

  >By that standard (and it is a good standard), none of these "AI" things are doing any thinking

  >"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.

Then what the fuck are they doing.

Learning is thinking, reasoning, what have you.

Move goalposts, re-define words, it won't matter.

48. Kiro ◴[15 Nov 24 16:37 UTC] No.42148468{6}[source]▶

>>42147515 #

You're speaking like it's confirmed. Do you have any proof?

Again, the comment you initially responded to was not talking about faking it by using a chess engine. You were the one introducing that theory.

replies(1): >>42150704 #

49. nine_k ◴[15 Nov 24 17:09 UTC] No.42148751{4}[source]▶

>>42147292 #

Somehow they're not in the limelight, and lack a well-known open-source runner implementation (like llama.cpp).

Given the potential, they should be winning hands down; where's that?

50. anon84873628 ◴[15 Nov 24 17:17 UTC] No.42148826{4}[source]▶

>>42146329 #

The problem is simply that the company has not been open about how it works, so we're all just speculating here.

51. sanderjd ◴[15 Nov 24 17:45 UTC] No.42149134{6}[source]▶

>>42147515 #

Fair!

52. dartos ◴[15 Nov 24 18:47 UTC] No.42149673{4}[source]▶

>>42146324 #

So… we’re at expert systems again?

That’s how the AI winter started last time.

replies(1): >>42157158 #

53. internetter ◴[15 Nov 24 19:00 UTC] No.42149781{4}[source]▶

>>42146293 #

This is a relatively recent development (<3 months), at least for OpenAI, where the model will generate code to solve math and use the response

replies(1): >>42151065 #

54. mkipper ◴[15 Nov 24 19:23 UTC] No.42149986{4}[source]▶

>>42146582 #

Chess might not be a great example, given that most people interested in analyzing chess moves probably know that chess engines exist. But it's easy to find examples where this approach would be very helpful.

If I'm an undergrad doing a math assignment and want to check an answer, I may have no idea that symbolic algebra tools exist or how to use them. But if an all-purpose LLM gets a screenshot of a math equation and knows that its best option is to pass it along to one of those tools, that's valuable to me even if it isn't valuable to a mathematician who would have just cut out of the LLM middle-man and gone straight to the solver.

There are probably a billion examples like this. I'd imagine lots of people are clueless that software exists which can help them with some problem they have, so an LLM would be helpful for discovery even if it's just acting as a pass-through.

replies(1): >>42151710 #

55. waffletower ◴[15 Nov 24 20:05 UTC] No.42150365{4}[source]▶

>>42146397 #

It is good to see other people buttressing Stephen Wolfram's ego. It is extraordinarily heavy work and Stephen can't handle it all by himself.

56. waffletower ◴[15 Nov 24 20:16 UTC] No.42150449{3}[source]▶

>>42146383 #

While deterministic components may be a left-brain default, there is no reason that such delegate services couldn't be more specialized ANN models themselves. It would most likely vastly improve performance if they were evaluated in the same memory space using tensor connectivity. In the specific case of chess, it is helpful to remember that AlphaZero utilizes ANNs as well.

57. nuancebydefault ◴[15 Nov 24 20:27 UTC] No.42150584{6}[source]▶

>>42147436 #

It looks like you have a deja vu

58. scott_w ◴[15 Nov 24 20:36 UTC] No.42150704{7}[source]▶

>>42148468 #

No, I don’t have proof and I never suggested I did. Yes, it’s 100% hypothetical but I assumed everyone engaging with me understood that.

59. cruffle_duffle ◴[15 Nov 24 21:04 UTC] No.42151029{3}[source]▶

>>42146152 #

If they came out and said it, I don’t see the problem. LLM’s aren’t the solution for a wide range of problems. They are a new tool but not everything is a nail.

I mean it already hands off a wide range of tasks to python… this would be no different.

60. cruffle_duffle ◴[15 Nov 24 21:06 UTC] No.42151049{5}[source]▶

>>42147547 #

But it already hands off plenty of stuff to things like python. How would this be any different.

replies(1): >>42154898 #

61. cruffle_duffle ◴[15 Nov 24 21:07 UTC] No.42151065{5}[source]▶

>>42149781 #

They’ve been doing that a lot longer than three months. ChatGPT has been handing stuff off to python for a very long time. At least for my paid account anyway.

62. vimbtw ◴[15 Nov 24 21:08 UTC] No.42151081[source]▶

>>42144784 (TP) #

This is exactly it. Here’s the pull request where chess evals were added: https://github.com/openai/evals/pull/45.

63. skydhash ◴[15 Nov 24 21:47 UTC] No.42151536{3}[source]▶

>>42146086 #

Wasn't that the basis of computing and technology in general? Here is one tedious thing, let's have a specific tool that handles it instead of wasting time and efforts. The fact is that properly using the tool takes training and most of current AI marketing are hyping that you don't need that. Instead, hand over the problem to a GPT and it will "magically" solve it.

64. landryraccoon ◴[15 Nov 24 21:57 UTC] No.42151638{7}[source]▶

>>42146320 #

Making the OP feel threatened/emotionally attached/both enough to call the language model a rival / companion / peer instead of a tool.

replies(1): >>42176541 #

65. mabster ◴[15 Nov 24 22:04 UTC] No.42151710{5}[source]▶

>>42149986 #

Even knowing that the software exists isn't enough. You have to learn how to use the thing.

66. bigiain ◴[15 Nov 24 22:48 UTC] No.42152158{3}[source]▶

>>42146383 #

Next thing, the "manager AIs" start stack ranking the specialized "worker AIs".

And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".

And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.

Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.

Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.

Nobody in powerful positions care. Humanity dies.

replies(1): >>42209924 #

67. jahnu ◴[15 Nov 24 23:11 UTC] No.42152352{7}[source]▶

>>42147158 #

Nice. Even the tiniest rule, I strongly suspect, would throw off pattern matching. “Every second move, swap the name of the piece you move to the last piece you moved.”

68. scott_w ◴[16 Nov 24 06:48 UTC] No.42154898{6}[source]▶

>>42151049 #

If you mean “uses bin/python to run Python code it wrote” then that’s a bit different to “recognises chess moves and feeds them to Stockfish.”

If a human said they could code, you don’t expect them to somehow turn into a Python interpreter and execute it in their brain. If a human said they could play chess, I’d raise an eyebrow if they just played the moves Stockfish gave them against me.

69. kazinator ◴[16 Nov 24 15:05 UTC] No.42156785[source]▶

>>42145811 #

That's not much different from a compiler being rigged to recognize a specific benchmark program and spit out a canned optimization.

replies(1): >>42171356 #

70. kadoban ◴[16 Nov 24 16:22 UTC] No.42157158{5}[source]▶

>>42149673 #

What is an "expert system" to you? In AI they're just series of if-then statements to encode certain rules. What non-trivial part of an LLM reaching out to a chess AI does that describe?

replies(1): >>42160230 #

71. dartos ◴[16 Nov 24 22:58 UTC] No.42160230{6}[source]▶

>>42157158 #

The initial LLM acts as an intention detection mechanism switch.

To personify LLM way too much:

It sees that a prompt of some kind wants to play chess.

Knowing this it looks at the bag of “tools” and sees a chess tool. It then generates a response which eventually causes a call to a chess AI (or just chess program, potentially) which does further processing.

The first LLM acts as a ton of if-then statements, but automatically generated (or brute-forcly discovered) through training.

You still needed discrete parts for this system. Some communication protocol, an intent detection step, a chess execution step, etc…

I don’t see how that differs from a classic expert system other than the if statement is handled by a statistical model.

72. threatripper ◴[17 Nov 24 06:53 UTC] No.42162440{4}[source]▶

>>42146582 #

You take a picture of a chess board and send it to ChatGPT and it replies with the current evaluation and the best move/strategy for black and white.

73. Peteragain ◴[18 Nov 24 11:01 UTC] No.42171356{3}[source]▶

>>42156785 #

.. or a Volkswagen recognising an emissions test and turning off power mode...

74. Jerrrrrrry ◴[18 Nov 24 20:25 UTC] No.42176541{8}[source]▶

>>42151638 #

Lolol. It's a chess thread, say it.

We are pawns, hoping to be maybe a Rook to the King by endgame.

Some think we can promote our pawns to Queens to match.

Luckily, the Jester muses!

75. MyFirstSass ◴[21 Nov 24 23:57 UTC] No.42209924{4}[source]▶

>>42152158 #

And “comment of the year” award goes to.

Sorry for the filler but this is amazingly put and so true.

We’ll get so many unintended consequences that are opposite any worthy goals when it’s AIs talking to AIs in a few years.

↑