(github.com)

745 points melded | 1 comments | 16 Nov 25 15:00 UTC | HN request time: 0.001s | source

Show context

startupsfail ◴[16 Nov 25 16:51 UTC] No.45946473[source]▶

It feels like to really censor the model it needs to be pre-trained on a distribution of data derived from a well defined and synthetic source, like TinyStories. Otherwise... world model would still be capable of modeling the original distribution.

replies(2): >>45946593 #>>45949318 #

1. int_19h ◴[16 Nov 25 23:13 UTC] No.45949318[source]▶

>>45946473 #

I'm pretty sure that any world model that is inherently incapable of "bad outputs" would be too castrated in general to the point where it'd be actively detrimental to overall model quality. Even as it is, with RLHF "alignment", we already know that it has a noticeable downwards effect on raw scores.

↑

Heretic: Automatic censorship removal for language models