Most active commenters

    ←back to thread

    54 points amai | 16 comments | | HN request time: 0.801s | source | bottom
    1. freeone3000 ◴[] No.42161812[source]
    I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
    replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #
    2. ipython ◴[] No.42162124[source]
    We’ve seen where that ends up. https://en.m.wikipedia.org/wiki/Tay_(chatbot)
    3. wruza ◴[] No.42162181[source]
    Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.

    In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.

    replies(1): >>42163677 #
    4. threeseed ◴[] No.42162295[source]
    The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.

    And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.

    replies(3): >>42162675 #>>42163652 #>>42165642 #
    5. Zambyte ◴[] No.42162664[source]
    What tools do we have to defend against LLM lockdown attacks?
    6. Zambyte ◴[] No.42162675[source]
    How does making it harder for the user to extract information they are trying to extract make it safer for a wider audience?
    replies(2): >>42162977 #>>42163153 #
    7. dbspin ◴[] No.42162977{3}[source]
    Assuming that this question is good faith...

    There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.

    Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.

    8. Drakim ◴[] No.42163153{3}[source]
    That's like asking why we should have porn filters on school computers, after all, all it does is prevent the user from finding what they are looking for, which is bad.
    replies(1): >>42172066 #
    9. selfhoster11 ◴[] No.42163652[source]
    Here is a revolutionary concept: give the users a toggle.

    Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.

    replies(1): >>42166788 #
    10. GuB-42 ◴[] No.42163677[source]
    We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.

    There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.

    replies(2): >>42163862 #>>42165478 #
    11. ujikoluk ◴[] No.42163862{3}[source]
    Maybe we should just avoid trying to classify things as good or bad.
    12. BriggyDwiggs42 ◴[] No.42165478{3}[source]
    But I have no idea why someone might want an LLM to act like a nazi. People read mein kampf in order to study the psychology of a madman and such.
    13. freeone3000 ◴[] No.42165642[source]
    If you are making an LLM for children, I have no problem with that! I’m not sure kids being completely removed from the adult world until suddenly being dumped into it is a great way to build an integrated society, but sure, you do you. Build your LLM with safeguards for educational use, best of luck to you!

    I do not think it should be the default. I do not think that “adults” wanting “adult things” like some ideas on how to secure a computer system against social engineering should have to seek out some detuned “jailbroken” lower-quality model.

    And I don’t think that assuming everyone is a child aligns with “human desires”, or should be couched in that language.

    14. threeseed ◴[] No.42166788{3}[source]
    Not sure if you understand how LLMs work.

    But the guard rails are intrinsic to the model itself. You cant just have a toggle.

    replies(1): >>42171141 #
    15. selfhoster11 ◴[] No.42171141{4}[source]
    Yes, you very much can. One very simple way to do so is to have two variants deployed: the censored one, and the uncensored one. The switch simply changes between which of the two you are using. You have to juggle two variants now across your inference infrastructure, but I expect OpenAI to be able to deal with this already due to A/B testing requirements. And it's not like these companies don't have internal-only uncensored versions of these models for red teaming etc, so you aren't spending money building something new.

    It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.

    16. ◴[] No.42172066{4}[source]