I extracted the safety filters from Apple Intelligence models

(github.com)

534 points BlueFalconHD | 1 comments | 06 Jul 25 19:50 UTC | HN request time: 0s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

trebligdivad ◴[06 Jul 25 20:56 UTC] No.44483981[source]▶

>>44483485 (OP) #

Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

replies(11): >>44483999 #>>44484073 #>>44484095 #>>44484410 #>>44484636 #>>44486072 #>>44487916 #>>44488185 #>>44488279 #>>44488362 #>>44488856 #

grues-dinner ◴[06 Jul 25 21:09 UTC] No.44484073[source]▶

>>44483981 #

Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

replies(11): >>44484164 #>>44484360 #>>44484635 #>>44484665 #>>44485033 #>>44485034 #>>44486246 #>>44487244 #>>44488055 #>>44488114 #>>44500918 #

qingcharles ◴[06 Jul 25 21:22 UTC] No.44484164[source]▶

>>44484073 #

It's totally performative. There's no way to stay ahead of the new language that people create.

At what point do the new words become the actual words? Are there many instances of people using unalive IRL?

replies(17): >>44484171 #>>44484218 #>>44484614 #>>44484958 #>>44484970 #>>44484989 #>>44485202 #>>44485277 #>>44485309 #>>44486128 #>>44486394 #>>44487625 #>>44487839 #>>44487936 #>>44488097 #>>44488704 #>>44493436 #

1. jama211 ◴[07 Jul 25 18:44 UTC] No.44493436[source]▶

>>44484164 #

Reducing the language used or making it harder does have measurable effects, it’s a logical fallacy in general that unless you can prevent something perfectly that thing will occur with the same frequency.

See many examples such as “padlocks are useless because a determined smart attacker can defeat them easily so don’t bother with them” - which conveniently forgets that many crimes are committed by non-determined, dumb and opportunistic attackers who are often deterred by simple locks.

Yes, people will use other words. No, this does not make this purely performative. It has measurable effects on behaviour and how these models will be used and spoken to, which affects outcomes.

↑