Could models mitigate this by answering questions incorrectly with random information instead of outright refusing to answer them?
replies(1):
but with some (unmodified) models ive tried (i dont remember names unfortunately) it definitely seemed like they werent trained to outright refuse things but answer poorly instead. so it is my impression that that is indeed a strategy that some model producers use?
(if anyone can debunk this id be interested in hearing it, im only superficially familiar with the methods in use, and this is basically a guess about what would explain why those models behaved the way they did.)