AI assistants misrepresent news content 45% of the time

(www.bbc.co.uk)

423 points sohkamyung | 1 comments | 22 Oct 25 13:39 UTC | HN request time: 0s | source

Show context

scarmig ◴[22 Oct 25 14:46 UTC] No.45669929[source]▶

If you dig into the actual report (I know, I know, how passe), you see how they get the numbers. Most of the errors are "sourcing issues": the AI assistant doesn't cite a claim, or it (shocking) cites Wikipedia instead of the BBC.

Other issues: the report doesn't even say which particular models it's querying [ETA: discovered they do list this in an appendix], aside from saying it's the consumer tier. And it leaves off Anthropic (in my experience, by far the best at this type of task), favoring Perplexity and (perplexingly) Copilot. The article also intermingles claims from the recent report and the one on research conducted a year ago, leaving out critical context that... things have changed.

This article contains significant issues.

replies(7): >>45669943 #>>45670942 #>>45671401 #>>45672311 #>>45672577 #>>45675250 #>>45679322 #

afavour ◴[22 Oct 25 14:47 UTC] No.45669943[source]▶

>>45669929 #

> or it (shocking) cites Wikipedia instead of the BBC.

No... the problem is that it cites Wikipedia articles that don't exist.

> ChatGPT linked to a non-existent Wikipedia article on the “European Union Enlargement Goals for 2040”. In fact, there is no official EU policy under that name. The response hallucinates a URL but also, indirectly, an EU goal and policy.

replies(6): >>45670006 #>>45670093 #>>45670094 #>>45670184 #>>45670903 #>>45672812 #

menaerus ◴[22 Oct 25 14:56 UTC] No.45670094[source]▶

>>45669943 #

> For the current research, a set of 30 “core” news questions was developed

Right. Let's talk about statistics for a bit. Or let's put it differently: they found in their report that 45% of the answers for 30 questions they have "developed" had a significant issue, e.g. inexisting reference

I'll give you 30 questions out of my sleeve where 95% of the answers will not have any significant issue.

replies(1): >>45670270 #

matthewmacleod ◴[22 Oct 25 15:05 UTC] No.45670270[source]▶

>>45670094 #

Yes, I'm sure you could hack together some bullshit questions to demonstrate whatever you want. Is there a specific reason that the reasonably straightforward methodology they did use is somehow flawed?

replies(1): >>45670445 #

menaerus ◴[22 Oct 25 15:16 UTC] No.45670445{3}[source]▶

>>45670270 #

Yes, and you answered it yourself.

replies(1): >>45670661 #

darkwater ◴[22 Oct 25 15:31 UTC] No.45670661{4}[source]▶

>>45670445 #

Err, no? Being _possible_ does not necessarily imply that's what happened.

replies(1): >>45670934 #

menaerus ◴[22 Oct 25 15:47 UTC] No.45670934{5}[source]▶

>>45670661 #

A bucket of 30 questions is not a statistically significant sample size which we can use to support the hypothesis which goes to say that all AI assistants they tested are 45% of the time wrong. That's not how science works.

Neither is my bucket of 30 questions statistcally significant but it goes to say that I can disprove their hypothesis just by giving them my sample.

I think that the report is being disingenious and I don't understand for what reasons. it's funny that they say "misrepresent" when that's exactly what they are doing.

replies(2): >>45672359 #>>45678070 #

1. frm88 ◴[23 Oct 25 04:05 UTC] No.45678070{6}[source]▶

>>45670934 #

I don't follow your reasoning re. statistical sample size. The topic article claims that 45% of the answers were wrong. If - with a vastly greater sample size - the answers were "only" (let's say) 20% wrong, that's still a complete failure, so is 5%. The article is not about hypothesis, it's about news reporting.

↑