(github.com)

536 points BlueFalconHD | 1 comments | 06 Jul 25 19:50 UTC | HN request time: 0.207s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

mike_hearn ◴[06 Jul 25 20:38 UTC] No.44483836[source]▶

>>44483485 (OP) #

Are you sure it's fully deobfuscated? What's up with reject phrases like "Granular mango serpent"?

replies(9): >>44483870 #>>44483918 #>>44483982 #>>44484014 #>>44484047 #>>44484460 #>>44484489 #>>44486400 #>>44488390 #

1. pbhjpbhj ◴[06 Jul 25 21:05 UTC] No.44484047[source]▶

>>44483836 #

Speculation: Maybe they know that the real phrase is close enough in the vector space to be treated as synonymous with "granular mango serpent". The phrase then is like a nickname that only the models authors know the expected interference of?

Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.

↑

I extracted the safety filters from Apple Intelligence models