Heretic: Automatic censorship removal for language models

from what i understand, they dont really have the self-awareness/agency to do this kind of thing on purpose as a response to abliteration (although if they end up having to converse on topics for which there was no data in their training dataset, they will produce incorrect and random information, but not for lack of "trying").

but with some (unmodified) models ive tried (i dont remember names unfortunately) it definitely seemed like they werent trained to outright refuse things but answer poorly instead. so it is my impression that that is indeed a strategy that some model producers use?

(if anyone can debunk this id be interested in hearing it, im only superficially familiar with the methods in use, and this is basically a guess about what would explain why those models behaved the way they did.)