←back to thread

46 points petethomas | 1 comments | | HN request time: 0.204s | source
Show context
cs702 ◴[] No.44397626[source]
TL;DR: Fine-tuning an AI model on the narrow task of writing insecure code induces broad, horrifically bad misalignment.

The OP's authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.

The OP summarizes recent research by the same authors: "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).

replies(1): >>44397764 #
1. knuppar ◴[] No.44397764[source]
Thank you for the links!