I extracted the safety filters from Apple Intelligence models

(github.com)

534 points BlueFalconHD | 1 comments | 06 Jul 25 19:50 UTC | HN request time: 0s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

binarymax ◴[06 Jul 25 20:50 UTC] No.44483936[source]▶

>>44483485 (OP) #

Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

replies(7): >>44484127 #>>44484154 #>>44484177 #>>44484296 #>>44484501 #>>44484693 #>>44489367 #

tpmoney ◴[06 Jul 25 21:24 UTC] No.44484177[source]▶

>>44483936 #

I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.

replies(1): >>44484332 #

XorNot ◴[06 Jul 25 21:40 UTC] No.44484332[source]▶

>>44484177 #

It would also substantially disrupt the generation process: a model which sees B0ris and not Boris is going to struggle to actually associate that input to the politician since it won't be well represented in the training set (and on the output side the same: if it does make the association, a reasoning model for example would include the proper name in the output first at which point the supervisor process can reject it).

replies(3): >>44484499 #>>44484952 #>>44485371 #

1. binarymax ◴[07 Jul 25 00:11 UTC] No.44485371[source]▶

>>44484332 #

No it doesn't disrupt. This is a well known capability of LLMs. Most models don't even point out a mistake they just carry on.

https://chatgpt.com/share/686b1092-4974-8010-9c33-86036c88e7...

↑