One
incorrect way to think of it is "LLMs will sometimes hallucinate when asked to produce content, but will provide grounded insights when merely asked to review/rate existing content".
A more productive (and secure) way to think of it is that all LLMs are "evil genies" or extremely smart, adversarial agents. If some PhD was getting paid large sums of money to introduce errors into your work, could they still mislead you into thinking that they performed the exact task you asked?
Your prompt is
‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’
- It is easy for the (compromised) reviewer to surface false positives: nitpick citations that are in fact correct, by surfacing irrelevant or made-up segments of the original research, hence making you think that the citation is incorrect.
- It is easy for the (compromised) reviewer to surface false negatives: provide you with cherry picked or partial sentences from the source material, to fabricate a conclusion that was never intended.
You do not solve the problem of unreliable actors by splitting them into two teams and having one unreliable actor review the other's work.
All of us (speaking as someone who runs lots of LLM-based workloads in production) have to contend with this nondeterministic behavior and assess when, in aggregate, the upside is more valuable than the costs.