I had a problem where I used GPT-4o to help me with inventory management, something a 5th grade kid could handle, and it kept screwing up values for a list of ~50 components. I ended up spending more time trying to get it to properly parse the input audio (I read off the counts as I moved through inventory bins) then if I had just done it manually.
On the other hand, I have had good success with having it write simple programs and apps. So YMMV quite a lot more than with a regular person.
This generally means for a task like you are doing, you need to have sign posts in the data like minute markers or something that it can process serially.
This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical regardless of the model. Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
I by no means have a 'maximal' position. I have said that they exceed the intelligence and ability of the vast majority of the human populace when it comes to their singular sense and action (ingesting language and outputting language). I fully stand by that, because it's true. I've not claimed that they exceed everyone's intelligence in every area. However, their ability to synthesize wildly different fields is well beyond most human's ability. Yes, I do believe we've crossed the tipping point. As it is, these things are not noticeable except in retrospect.
> The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical
I disagree with the idea that human failure modes are different between people. I think this is the result of not thinking at a high enough level. Human failure modes are often very similar. Drama authors make a living off exploring human failure modes, and there's a reason why they say there are no new stories.
I agree that Human and LLM failure modes are different, but that's to be expected.
> regardless of the model
As far as I'm aware, all LLMs in common use today use a variant of the transformer. Transformers have much different pitfalls compared to RNNs (RNNs are parlticularly bad at recall for example).
> Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
Most people can't draw very well anyway, so this is just proving my point.
Ranking / sorting is O(n log n) no matter what. Given that a transformer runs in constant time before we 'force' it to output an answer, there must be an M such that beyond that length it cannot reliably sort a list. This MUST be the case and can only be solved by running the model some indeterminate number of times, but I don't believe we currently have any architecture to do that.
Note that humans have the same limitation. If you give humans a time limit, there is a maximum number of things they will be able to sort reliably in that time.
And you're proving my point. The ways in which the people would fail to draw the wine glass are different from the LLM. The vast majority of people would fail to reproduce a photorealistic simile. But the vast majority of people would meet the requirement of drawing it filled to the brim. The LLMs absolutely succeed at the quality of the drawing but absolutely fail at meeting human specifications and expectations. Generously, you can say it's a different kind of intelligence. But saying it's more intelligent than humans requires you to use a drastically different axis akin to the one you'd use saying that computers are smarter than humans because they can add two numbers more quickly.
> But the vast majority of people would meet the requirement of drawing it filled to the brim.
But both are failures, right? It's just a cognitive bias that we don't expect artistic ability of most people.
> But saying it's more intelligent than humans requires you to use a drastically different axis
I'm not going to rehash this here, but as I said elsewhere in this thread, intelligences are different. There's no one metric, but for many common human tasks, the ability of the LLMs surpasses humans.
> saying that computers are smarter than humans because they can add two numbers more quickly.
This is where I disagree. Unlike a traditional program, both humans and LLMs can take unstructured input and instruction. Yes, they can both fail and they fail differently (or succeed in different ways), but there is a wide gulf between the sort of structured computation a traditional program does and an llm.
No, I'd say very different failures. The LLM is failing at reasoning and understanding whereas people are failing at training. Humans can fix the training part by simply doing the task repetitively. LLMs can't fix the understanding part because it's a fundamental flaw in the design. It's like categorizing a chimp's inability to understand logical reasoning as "cognitive bias" - no it's a much more structural problem.
> intelligences are different. There's no one metric, but for many common human tasks, the ability of the LLMs surpasses humans
There isn't one metric, and yes LLMs surpass humans on various tasks. But we've not been able to establish any evidence that the mechanism that they operate by is intelligence. It's certainly the closest we've come to building something artificial that approximates it to a high degree in some cases. But there's still no indication this isn't just a general purpose ML algorithm or has anything approaching human intelligence or sentience - basically it can mimic various human skills related to generative intelligence (writing and drawing) but less clear it can mimic anything else.
> This is where I disagree. Unlike a traditional program, both humans and LLMs can take unstructured input and instruction
That is true but it's a huge claim and leap to then say that anything taking unstructured input and instruction is demonstrating intelligence, especially when it fails to execute the requested instructions correctly regardless how much correction you have to do (as demonstrated by the wine glass problem & many other similar kinds of failure points).
There's reason to believe that there's a difference from a power perspective & from the fact that transformers are not self-learning from additional input whereas humans meld short term and long term learning while things like ChatGPT bolt on "memories" which are just factoids stored in a RAG and not something that the transformer is learning as new data.