This is one cause but another is that agents are mostly trained using the same sets of problems. There are only so many open source projects that can be used for training (ie. benchmarks). There's huge oversampling for a subset of projects like pandas and nothing at all for proprietary datasets. This is a huge problem!
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).