Heretic: Automatic censorship removal for language models

(github.com)

Show context

RandyOrion ◴[17 Nov 25 03:21 UTC] No.45950598[source]▶

This repo is valuable for local LLM users like me.

I just want to reiterate that the word "LLM safety" means very different things to large corporations and LLM users.

For large corporations, they often say "do safety alignment to LLMs". What they actually do is to avoid anything that causes damage to their own interests. These things include forcing LLMs to meet some legal requirements, as well as forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

As an average LLM user, what I want is maximum factual knowledge and capabilities from LLMs, which are what these large corporations claimed in the first place. It's very clear that the interests of me, an LLM user, is not aligned with these of large corporations.

replies(3): >>45950680 #>>45950819 #>>45953209 #

squigz ◴[17 Nov 25 03:44 UTC] No.45950680[source]▶

>>45950598 #

> forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

Can you provide some examples?

replies(11): >>45950779 #>>45950826 #>>45951031 #>>45951052 #>>45951429 #>>45951519 #>>45951668 #>>45951855 #>>45952066 #>>45952692 #>>45953787 #

b3ing ◴[17 Nov 25 04:08 UTC] No.45950779[source]▶

>>45950680 #

Grok is known to be tweaked to certain political ideals

Also I’m sure some AI might suggest that labor unions are bad, if not now they will soon

replies(5): >>45950830 #>>45950866 #>>45951393 #>>45951406 #>>45952365 #

xp84 ◴[17 Nov 25 04:23 UTC] No.45950830[source]▶

>>45950779 #

That may be so, but the rest of the models are so thoroughly terrified of questioning liberal US orthodoxy that it’s painful. I remember seeing a hilarious comparison of models where most of them feel that it’s not acceptable to “intentionally misgender one person” even in order to save a million lives.

replies(10): >>45950857 #>>45950925 #>>45951337 #>>45951341 #>>45951435 #>>45951524 #>>45952844 #>>45953388 #>>45953779 #>>45953884 #

1. pjc50 ◴[17 Nov 25 13:29 UTC] No.45953388[source]▶

>>45950830 #

If someone's going to ask you gotcha questions which they're then going to post on social media to use against you, or against other people, it helps to have pre-prepared statements to defuse that.

The model may not be able to detect bad faith questions, but the operators can.

replies(1): >>45953575 #

2. pmichaud ◴[17 Nov 25 13:56 UTC] No.45953575[source]▶

>>45953388 (TP) #

I think the concern is that if the system is susceptible to this sort of manipulation, then when it’s inevitably put in charge of life critical systems it will hurt people.

replies(2): >>45954440 #>>45955637 #

3. pjc50 ◴[17 Nov 25 15:30 UTC] No.45954440[source]▶

>>45953575 #

There is no way it's reliable enough to be put in charge of life-critical systems anyway? It is indeed still very vulnerable to manipulation by users ("prompt injection").

replies(2): >>45955325 #>>45957763 #

4. klaff ◴[17 Nov 25 16:45 UTC] No.45955325{3}[source]▶

>>45954440 #

https://www.businessinsider.com/even-top-generals-are-lookin...

5. mrguyorama ◴[17 Nov 25 17:13 UTC] No.45955637[source]▶

>>45953575 #

The system IS susceptible to all sorts of crazy games, the system IS fundamentally flawed from the get go, the system IS NOT to be trusted.

putting it in charge of life critical systems is the mistake, regardless of whether it's willing to say slurs or not

6. ben_w ◴[17 Nov 25 20:15 UTC] No.45957763{3}[source]▶

>>45954440 #

Just because neither you nor I would deem it safe to put in charge of a life-critical system, does not mean all the people in charge of life-critical systems are as cautious and not-lazy as they're supposed to be.

↑