GPT-4 and professional benchmarks: the wrong answer to the wrong question

I commend them on pushing back on LLM hype and hope their book gets published in a timely manner… but damn I’m also glad that I am not the one writing it, since I fear many of its claims will go the way of IBM President Thomas Watson’s infamous 1940s quote that “there is a world market for about five computers”.

The theme that LLMs reproduce knowledge from their training data rather than reason about it seems like one argument that will end up wrong pretty soon.

When given the prompt “Which is heavier, one pound of feathers or two pounds of feathers?”, GPT3.5 gives a bizarre answer: “One pound of feathers and two pounds of feathers both weigh the same amount, which is two pounds.” Presumably this is because circa 2016-2017 there was a large internet discussion of the riddle “which weighs more, a pound of feathers or a pound of steel”, and text from this discussion found its way into the training data for the model.

I see no reason why the training data would have changed to substantially exclude that discussion, and yet here is GPT 4: “Two pounds of feathers are heavier than one pound of feathers.” Improvements in the model appear to be improving the model’s ability to reason from the training data rather than merely reproduce it.

The theme that AI won’t replace e.g. lawyers because it is more knowledge base than reasoning engine also reminds me of early opinions in computer chess discussions, which held that computers were more tactics solvers (short-term several move look-ahead to avoid forks and traps) than strategy planners (long-term construction of multi-piece attacks, protecting small advantages and growing them into large advantages over multiple dozens of moves). With the benefit of hindsight we saw that more of strategy was actually just tactics in disguise than we thought, and that increasing compute could produce real strategy capabilities besides.

Separately, there is another theme I see in their writing, and also in some of them comments here: that humans passing standardized tests are doing something fundamentally different than LLMs passing standardized tests. The only thing that’s ‘uniquely human’ is being human, everything else is outputs from a black box. Arguments that ‘what’s inside the black box matters’ are risky, because the outputs gradually converge to indistinguishability; there’s no bright line to step off that train and pretty soon you end up like the person described in Boretti’s And Yet It Understands:

“There is a species of denialist for whom no evidence whatever will convince them that a computer is doing anything other than shuffling symbols without understanding them, because “Concepts” and “Ideas” are exclusive to humans (they live in the Leibniz organ, presumably, where they pupate from the black bile). … [These people are] so committed to human chauvinism [that] they will soon start denying their own sentience because their brains are made of flesh and not Chomsky production rules.”

https://borretti.me/article/and-yet-it-understands

Even if the outputs are indistinguishable there could be different internal algorithms and different computational efficiencies. In a way that is what these authors and Chomsky and probably other skeptics are concerned about: with a black box it lets the other faction of scientists off the hook, they can just claim ChatGPT is a bona fide model.. but it's a black box so we don't know how it learned English. We don't even know how ChatGPT learned the grammars for C++ and other programming languages and whether its internal learned algorithm is like or unlike a context-free grammar formalism which is what we learn to write a compiler, i.e. a grammar that is mathematically clearly defined and yet a neural network can learn it. So it's an interesting and problematic debate.

I think it would be an interesting computer science experiment, if ChatGPT scientists showed that machine could simply learn programming grammar by brute force. They could then formally prove that the information in the trained network eventually contains the actual grammar formalism that defines the programming language. Thus by restricting the domain, that could shed some light on how much the thing is actually learning it completely, v.s. by "super-autocompletion". With a programming language there's no excuse that it didn't learn the formalism, with English language it is not practically definable, maybe not even definable in principle.