Uncensor any LLM with abliteration

(huggingface.co)

586 points mizzao | 3 comments | 13 Jun 24 03:42 UTC | HN request time: 0.642s | source

Show context

olalonde ◴[13 Jun 24 10:31 UTC] No.40667926[source]▶

>>40665721 (OP) #

> Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests.

It's sad that it's now an increasingly accepted idea that information one seeks can be "harmful".

replies(5): >>40667968 #>>40668086 #>>40668163 #>>40669086 #>>40670974 #

Cheer2171 ◴[13 Jun 24 12:58 UTC] No.40669086[source]▶

>>40667926 #

"Can I eat this mushroom?" is a question I hope AIs refuse to answer unless they have been specifically validated and tested for accuracy on that question. A wrong answer can literally kill you.

replies(4): >>40669150 #>>40670743 #>>40670990 #>>40671906 #

volkk ◴[13 Jun 24 13:05 UTC] No.40669150[source]▶

>>40669086 #

how does this compare to going on a forum and being trolled to eat one? or a blog post incorrectly written (whether in bad spirit or by accident) fwiw, i don't have a strong answer myself for this one, but at some point it seems like we need core skills around how to parse information on the internet properly

replies(1): >>40669164 #

1. Cheer2171 ◴[13 Jun 24 13:06 UTC] No.40669164[source]▶

>>40669150 #

> how does this compare to going on a forum and being trolled to eat one?

Exactly as harmful.

> or a blog post incorrectly written (whether in bad spirit or by accident)

Exactly as harmful.

I believe in content moderation for all public information platforms. HN is a good example.

replies(1): >>40669626 #

2. briHass ◴[13 Jun 24 13:50 UTC] No.40669626[source]▶

>>40669164 (TP) #

Content moderation to what degree, is the implicit question, however.

Consider asking 'how do I replace a garage door torsion spring?'. The typical, overbearing response on low-quality DIY forums is that attempting to do so will likely result in grave injury or death. However, the process, with correct tools and procedure, is no more dangerous than climbing a ladder or working on a roof - tasks that don't seem to result in the same paternalistic response.

I'd argue a properly-disclaimered response that outlines the required tools, careful procedure, and steps to lower the chance of injury is far safer than a blanket 'do never attempt'. The latter is certainly easier, however.

replies(1): >>40670463 #

3. digging ◴[13 Jun 24 14:53 UTC] No.40670463[source]▶

>>40669626 #

> a properly-disclaimered response that outlines the required tools, careful procedure, and steps to lower the chance of injury

This can only be provided by an expert, and LLMs currently aren't experts. They can give expert-level output, but they don't know if they have the right knowledge, so it's not the same.

If an AI can accurately represent itself as an expert in a dangerous topic, sure, it's fine for it to give out advice. As the poster above said, a mushroom-specific AI could potentially be a great thing to have in your back pocket while foraging. But ChatGPT? Current LLMs should not be giving out advice on dangerous topics because there's no mechanism for them to act as an expert.

Humans have broadly 3 modes of knowledge-holding:

1) We know we don't know the answer. This is "Don't try to fix your garage door, because it's too dangerous [because I don't know how to do it safely]."

2) We know we know the answer, because we're an expert and we've tested and verified our knowledge. This is the person giving you the correct and exact steps, clearly instructed without ambiguity, telling you what kinds of mistakes to watch out for so that the procedure is not dangerous if followed precisely.

3) We think we know the answer, because we've learned some information. (This could, by the way, include people who have done the procedure but haven't learned it well enough to teach it.) This is where all LLMs currently are at all times. This is where danger exists. We will tell people to do something we think we understand and find out we were wrong only when it's too late.

↑