Ironic given that lesswrong folks who presented this did so as part of their mission of motivating policy makers to ban open access to models. Hate their ideology but love their research!
Edit: The data format is the same type used for DPO or RLHF style training. “Good” and “bad”, “harmful” vs “harmless”. What’s fun is to test the performance of this technique using your own datasets, to see how good the personalization is.
replies(2):