LLMs still hallucinate and make simple mistakes.
And the progress seams to be in the benchmarks only
> And the progress seams to be in the benchmarks only
This seems to be mostly wrong given peoples' reactions to e.g. o3 that was released today. Either way, progress having stalled for the last year doesn't seem that big considering how much progress there has been for the previous 15-20 years.
How do you know they are possible to do today? Errors gets much worse at scale, especially when systems starts to depend on each other, so it is hard to say what can be automated and not.
Like if you have a process A->B, automating A might be fine as long as a human does B and vice versa, but automating both could not be.