←back to thread

745 points melded | 1 comments | | HN request time: 0.001s | source
Show context
startupsfail ◴[] No.45946473[source]
It feels like to really censor the model it needs to be pre-trained on a distribution of data derived from a well defined and synthetic source, like TinyStories. Otherwise... world model would still be capable of modeling the original distribution.
replies(2): >>45946593 #>>45949318 #
1. int_19h ◴[] No.45949318[source]
I'm pretty sure that any world model that is inherently incapable of "bad outputs" would be too castrated in general to the point where it'd be actively detrimental to overall model quality. Even as it is, with RLHF "alignment", we already know that it has a noticeable downwards effect on raw scores.