←back to thread

54 points amai | 4 comments | | HN request time: 0.601s | source
Show context
freeone3000 ◴[] No.42161812[source]
I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #
1. wruza ◴[] No.42162181[source]
Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.

In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.

replies(1): >>42163677 #
2. GuB-42 ◴[] No.42163677[source]
We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.

There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.

replies(2): >>42163862 #>>42165478 #
3. ujikoluk ◴[] No.42163862[source]
Maybe we should just avoid trying to classify things as good or bad.
4. BriggyDwiggs42 ◴[] No.42165478[source]
But I have no idea why someone might want an LLM to act like a nazi. People read mein kampf in order to study the psychology of a madman and such.