(www.lesswrong.com)

110 points veryluckyxyz | 1 comments | 03 May 24 00:55 UTC | HN request time: 2.05s | source

Show context

hdhdhsjsbdh ◴[03 May 24 15:26 UTC] No.40248781[source]▶

Beyond its immediate appeal to the (somewhat cringy imo) “uncensored model” crowd, this has immediate practical use for improving data synthesis. I have had several experiences trying to create synthetic data for harmless or benign tasks, only to have noise introduced from overly conservative refusals.

replies(2): >>40249310 #>>40249750 #

amluto ◴[03 May 24 16:54 UTC] No.40249750[source]▶

>>40248781 #

I encountered this in an absurd context — I wanted a model (IIRC GPT 3.5) to make me some invalid UTF-8 strings. It refused! On safety grounds! After a couple minutes of fiddling, the refusal was surprisingly robust, although I admit I didn’t try litany of the usual model jailbreaking techniques.

On the one hand, good job OpenAI for training the model decently robustly. On the other hand, this entirely misses the point of “AI safety”.

replies(1): >>40252359 #

1. HanClinto ◴[03 May 24 21:02 UTC] No.40252359[source]▶

>>40249750 #

Reminds me of this nugget of Prime reacting to Gemini refusing to show C++ code to teenagers because it is "unsafe":

https://www.youtube.com/watch?v=r2npdV6tX1g

↑

Refusal in LLMs is mediated by a single direction