I have been unable to recreate any of the failure examples they gave. I don't have co-pilot, but at least Gemini 2.5 pro, ChatGPT5-Thinking, and Perplexity have all give the correct answers as outlined.[1]
They don't say what models they were actually using though, so it could be nano models that they asked. They also don't outline the structure of the tests. It seems rigor here was pretty low. Which frankly comes off a bit like...misrepresentation.
Edit: They do some outlining in the appendix of the study. They used GPT-4o, 2.5 flash, default free copilot, and default free perplexity.
So they used light weight and/or old models.
[1]https://www.bbc.co.uk/aboutthebbc/documents/news-integrity-i...
replies(1):