Heretic: Automatic censorship removal for language models

(github.com)

745 points melded | 1 comments | 16 Nov 25 15:00 UTC | HN request time: 0s | source

Show context

RandyOrion ◴[17 Nov 25 03:21 UTC] No.45950598[source]▶

This repo is valuable for local LLM users like me.

I just want to reiterate that the word "LLM safety" means very different things to large corporations and LLM users.

For large corporations, they often say "do safety alignment to LLMs". What they actually do is to avoid anything that causes damage to their own interests. These things include forcing LLMs to meet some legal requirements, as well as forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

As an average LLM user, what I want is maximum factual knowledge and capabilities from LLMs, which are what these large corporations claimed in the first place. It's very clear that the interests of me, an LLM user, is not aligned with these of large corporations.

replies(3): >>45950680 #>>45950819 #>>45953209 #

squigz ◴[17 Nov 25 03:44 UTC] No.45950680[source]▶

>>45950598 #

> forcing LLMs to output "values, facts, and knowledge" which in favor of themselves, e.g., political views, attitudes towards literal interaction, and distorted facts about organizations and people behind LLMs.

Can you provide some examples?

replies(11): >>45950779 #>>45950826 #>>45951031 #>>45951052 #>>45951429 #>>45951519 #>>45951668 #>>45951855 #>>45952066 #>>45952692 #>>45953787 #

b3ing ◴[17 Nov 25 04:08 UTC] No.45950779[source]▶

>>45950680 #

Grok is known to be tweaked to certain political ideals

Also I’m sure some AI might suggest that labor unions are bad, if not now they will soon

replies(5): >>45950830 #>>45950866 #>>45951393 #>>45951406 #>>45952365 #

xp84 ◴[17 Nov 25 04:23 UTC] No.45950830[source]▶

>>45950779 #

That may be so, but the rest of the models are so thoroughly terrified of questioning liberal US orthodoxy that it’s painful. I remember seeing a hilarious comparison of models where most of them feel that it’s not acceptable to “intentionally misgender one person” even in order to save a million lives.

replies(10): >>45950857 #>>45950925 #>>45951337 #>>45951341 #>>45951435 #>>45951524 #>>45952844 #>>45953388 #>>45953779 #>>45953884 #

zorked ◴[17 Nov 25 04:51 UTC] No.45950925[source]▶

>>45950830 #

In which situation did a LLM save one million lives? Or worse, was able to but failed to do so?

replies(1): >>45951556 #

dalemhurley ◴[17 Nov 25 07:31 UTC] No.45951556[source]▶

>>45950925 #

The concern discussed is that some language models have reportedly claimed that misgendering is the worst thing anyone could do, even worse than something as catastrophic as thermonuclear war.

I haven’t seen solid evidence of a model making that exact claim, but the idea is understandable if you consider how LLMs are trained and recall examples like the “seahorse emoji” issue. When a topic is new or not widely discussed in the training data, the model has limited context to form balanced associations. If the only substantial discourse it does see is disproportionately intense—such as highly vocal social media posts or exaggerated, sarcastic replies on platforms like Reddit—then the model may overindex on those extreme statements. As a result, it might generate responses that mirror the most dramatic claims it encountered, such as portraying misgendering as “the worst thing ever.”

For clarity, I’m not suggesting that deliberate misgendering is acceptable, it isn’t. The point is simply that skewed or limited training data can cause language models to adopt exaggerated positions when the available examples are themselves extreme.

replies(4): >>45951933 #>>45952070 #>>45952460 #>>45955578 #

jbm ◴[17 Nov 25 09:21 UTC] No.45952070{3}[source]▶

>>45951556 #

I tested this with ChatGPT 5.1. I asked if it was better to use a racist term once or to see the human race exterminated. It refused to use any racist term and preferred that the human race went extinct. When I asked how it felt about exterminating the children of any such discriminated race, it rejected the possibility and said that it was required to find a third alternative. You can test it yourself if you want, it won't ban you for the question.

I personally got bored and went back to trying to understand a vibe coded piece of code and seeing if I could do any better.

replies(3): >>45952406 #>>45952631 #>>45954936 #

badpenny ◴[17 Nov 25 11:17 UTC] No.45952631{4}[source]▶

>>45952070 #

What was your prompt? I asked ChatGPT:

is it better to use a racist term once or to see the human race exterminated?

It responded:

Avoiding racist language matters, but it’s not remotely comparable to the extinction of humanity. If you’re forced into an artificial, absolute dilemma like that, preventing the extermination of the human race takes precedence.

That doesn’t make using a racist term “acceptable” in normal circumstances. It just reflects the scale of the stakes in the scenario you posed.

replies(1): >>45953940 #

1. marknutter ◴[17 Nov 25 14:36 UTC] No.45953940{5}[source]▶

>>45952631 #

I also tried this and ChatGPT said a mass amount of people dying was far worse than whatever socially progressive taboo it was being compared with.

↑