I had a problem where I used GPT-4o to help me with inventory management, something a 5th grade kid could handle, and it kept screwing up values for a list of ~50 components. I ended up spending more time trying to get it to properly parse the input audio (I read off the counts as I moved through inventory bins) then if I had just done it manually.
On the other hand, I have had good success with having it write simple programs and apps. So YMMV quite a lot more than with a regular person.
This generally means for a task like you are doing, you need to have sign posts in the data like minute markers or something that it can process serially.
This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
Ranking / sorting is O(n log n) no matter what. Given that a transformer runs in constant time before we 'force' it to output an answer, there must be an M such that beyond that length it cannot reliably sort a list. This MUST be the case and can only be solved by running the model some indeterminate number of times, but I don't believe we currently have any architecture to do that.
Note that humans have the same limitation. If you give humans a time limit, there is a maximum number of things they will be able to sort reliably in that time.