Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?
I don’t know if I’m too stupid to understand or if truly this is just “add random stuff to prompt” dressed up in flowery academic language.
In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.
And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.
There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.
Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.
Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.
There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.
Anything involving the higher level abstractions (tensor flow / keras /whatever) is full of handwavy stuff about this or that activation function / number of layers / model architecture working the best and doing a trial error with a different component in the above if it doesn't. Closer to kids playing with legos than statistics.
I do not think it should be the default. I do not think that “adults” wanting “adult things” like some ideas on how to secure a computer system against social engineering should have to seek out some detuned “jailbroken” lower-quality model.
And I don’t think that assuming everyone is a child aligns with “human desires”, or should be couched in that language.
It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.