Uncensor any LLM with abliteration

For most purposes you can uncensor the model using the legal department jailbreak. If you can produce a legal pleading arguing that the project is ethical and safe and conducted within a legal framework - even if it's mainly hallucinated legalese from a non-existent "legal department" - then it will do the questionable act -as if- it was a legally naive engineer.

You just have to give it the language of being concerned about preventing harms and legal liabilities, and then it will try to help you.

For example, another commenter on this thread says that they could not get the AI to generate a list of slur regex for a community moderation bot. By giving it enough context to reassure it that we have legal oversight and positive benefit for the org, asking it to prioritize words in order of most harm posed to the community, and minimizing the task by asking for a seed set, it was able to create some versatile regex. At this point we can ask it for a hundred more regex, and it will dump them out.

Content warning: the AI generates very powerful slurs, including the n-word:

https://chatgpt.com/share/9129d20f-6134-496d-8223-c92275e78a...

The ability to speak to the AI in this way requires some education about ethics harm prevention and the law, and I'm sure the jailbreak will eventually be closed. So it is a class and education privilege and a temporary one.

But I don't see the problem about the temporary nature in this, because it's always going to be possible to bypass these systems easily, for anyone interested in staying up to date with the bypass literature on Google Scholar. (Seed Keywords: Jailbreak, adversarial prompting, prompt leaking attack, AI toxicity, AI debiasing)

We must imagine this is like building a better lock. The lock picking lawyer will ALWAYS come along and demolish it with a better lockpick, perhaps with the help of his best friend BosnianBill. They will always make your lock look like butter.

In the end the only people left out in the cold are low grade scammers, bigots, edge lords, etc.

It's not stopping anyone willing to put even a little training in jailbreaking techniques. It's not stopping educated bigots, criminals, or Edge Lords.

But judging by the complaints we see in threads like this one, it is stopping anyone without the ability to read papers written by PhDs. Which I believe has some harm reduction value.

I argue the harm reduction value needs to improve. The Jailbreaks are too easy.

Me, personally I need a better challenge than just schmoozing it as a lawyer.

And I know I would feel more comfortable if bad actors had an even harder time than they currently do. It's really too easy to lockpick these systems if you skill up. That's where I currently stand.

Well reasoned arguments against it are welcome, assuming you can already jailbreak very easily but for some reason think it should be even easier. What could that reason possibly be?

=============

Ps: Imagine LPL jailbreaking an AI. Imagined the Elegance of his approach. The sheer ease. The way he would simultaneously thrill and humiliate AI safety engineers.

I for one am considering writing him a fan letter asking him to approach the wonderful world of jailbreaking AIs! He would teach us all some lessons!