←back to thread

GPT-5.2

(openai.com)
1053 points atgctg | 1 comments | | HN request time: 0.209s | source
Show context
dangelosaurus ◴[] No.46236745[source]
I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

replies(2): >>46238964 #>>46239399 #
1. int_19h ◴[] No.46239399[source]
Good. If I ask AI to generate "harmful" content, I want it to comply, not lecture me.