Even so... I still would be really surprised if there wasn't some systematic error here skewing the results, like the developers deliberately picked "easy" tasks that they already knew how to do, so implementing them themselves was particularly fast.
Seems like they authors had about as good methodology as you can get for something like this. It's just really hard to test stuff like this. I've seen studies proving that code comments don't matter for example... are you going to stop writing comments? No.