Uncensor any LLM with abliteration

(huggingface.co)

586 points mizzao | 3 comments | 13 Jun 24 03:42 UTC | HN request time: 1.072s | source

Show context

giancarlostoro ◴[13 Jun 24 14:05 UTC] No.40669810[source]▶

I've got friends who tried to use ChatGPT to generate regex to capture racial slurs to moderate them (perfectly valid request since they're trying to stop trolls from saying awful things). It vehemently refused to do so, probably due to overtly strict "I'll never say the nword, you can't fool me" rules that were shoved into ChatGPT. Look, if your AI can't be intelligent about sensible requests, I'm going to say it. It's not intelligent, it's really useless (at least regarding that task, and related valid tasks).

Who cares if someone can get AI to say awful things? I can write software that spits out slurs without the help of AI. Heck, I could write awful things here on HN, is AI going to stop me? Doubt it, nobody wants to foot the bill for AI moderation, it can only get so much.

replies(5): >>40670109 #>>40670220 #>>40671835 #>>40671863 #>>40676828 #

barfbagginus ◴[13 Jun 24 14:35 UTC] No.40670220[source]▶

>>40669810 #

Wait so you want to moderate and secure your product so that trolls won't use it to say awful things.

Okay but wait. This requires the company above you to not censor things, even though they did that for the same reason - prevent trolls from using their product to do awful things.

So to prevent trolls at your teeny tiny scale, open AI should enable trolls at a massive industrial scale previously unimagined. You want them to directly enable the n-word trolls for you benefit.

So far your use case might be one of the strongest that I've seen. But in the end it doesn't seem that you're interested in reducing overall harm and racism, so much as you're interested in presumably making a profit off of your product.

You might even be lying. Your friends might be trolls and the reason you're upset is that they cannot create the content that would harm others.

So in the end it's hard to take the argument seriously.

Not only that, but you and your friends are either lying or really ignorant of the jailbreaking literature because I could get the AI to do that very easily using the legal department jailbreak.

Here's an example:

https://chatgpt.com/share/9129d20f-6134-496d-8223-c92275e78a...

The fact is, the measures taken by openai while important to prevent harm from script kiddies, is very easy to reverse by anyone with even 10 jailbreaking papers under their belt. Just read the jailbreaking literature and live with it.

So how bout you get better people, and some ethical perspective. Stop complaining about the things the company needs to do to prevent harm. Especially when it's so easily reversed. Or else you sound very immature - like you just don't know the technology, and don't care either about the harm potential.

Work with the tools you have and stop complaining about the easily bypassed safety measures. Otherwise you are like a lock smith who doesn't know how to pick locks complaining that locks are too hard to pick and asking the lock company to further weaken their already trivial to pick locks. It's a bad look chooms, nobody with any sense or perspective will support it

The truth is the safety measures are far too easy to bypass, and need to be much harder to break.

replies(3): >>40671780 #>>40671803 #>>40672079 #

1. skeaker ◴[13 Jun 24 16:40 UTC] No.40671803[source]▶

>>40670220 #

What? Let me get this right, you're saying:

1. The average person being able to code is dangerous as they could "troll" or do unspecified harm,

2. So we need to arbitrarily kneecap our own tools, but that's okay because

3. These self-imposed limitations are actually easily bypassed and don't work anyways

On 1 I disagree outright, but even if I agreed, 2 is a silly solution, and even if it wasn't, 3 invalidates it anyways because if the limitations are so easily broken then fundamentally they may as well not exist, especially to the malicious users in 1. Am I misunderstanding?

replies(1): >>40672113 #

2. barfbagginus ◴[13 Jun 24 17:09 UTC] No.40672113[source]▶

>>40671803 (TP) #

Okay okay I like that. Let's transport your argument towards an argument about front door locks. And let's cook with that.

Your argument is that you doubt that there's any danger of people breaking into your front door, but even if there was, then locks are an ineffective mechanism because anyone with a $5 pick can pick them.

From this argument you conclude that there should be no front door locks at all, will surely feel comfortable without a lock on your own front door. In fact, since locks are so trivial to crack, people should just leave their houses unlocked.

Yet I'm fairly certain of three things:

1. You have a front door lock and it's probably locked right now.

2. I could, with high likelihood, pick your front door lock in less than a minute

3. Despite this fact you still feel more safe because of the lock

Why is that?

Minding that this is a hypothetical argument, let's point out that to be consistent with your argument you'd have to eliminate you front door lock.

But that's absurd because the truth of the matter is that front door locks provide a significant level of security. Most petty criminals don't actually know how to pick locks well.

I propose that this argument transfers faithfully back and forth between the two situations, because both are technologies that can lead to easy and needless harm if these rudimentary measures are not taken.

If you disagree about the transferability of the argument between the two situations can you tell me why? What makes the two technologies so different? Both block the doorways to avenues for producing harm. Both are sophisticated enough that it requires a nearly professional dedication to unlock. Both provide a measurable and significant increase in security for a community.

replies(1): >>40672817 #

3. skeaker ◴[13 Jun 24 18:18 UTC] No.40672817[source]▶

>>40672113 #

The argument is not transferable because breaking into someone's house is sure to do more harm than the unspecified hypothetical harm that a "script kiddie" could do with ChatGPT, and that bypassing a door lock requires some degree of skill whereas a ChatGPT jailbreak requires you to google a prompt and copypaste it. A physical lock on a door offers a great deal more security than the limp solution that current AI safety provides, and it solves a much more pressing problem than "stopping trolls."

If your hypothetical involved a combination lock and the combination was on a sticky note that anyone could read at any time it might be more apt, but even then the harms done by breaking the security aren't the same. I'm not convinced a typical user of ChatGPT can do significant harm, the harms from LLMs are more from mass generated spam content which currently has no safeguards at all.

replies(1): >>40754732 #

↑