I extracted the safety filters from Apple Intelligence models

(github.com)

536 points BlueFalconHD | 2 comments | 06 Jul 25 19:50 UTC | HN request time: 0.532s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

binarymax ◴[06 Jul 25 20:50 UTC] No.44483936[source]▶

>>44483485 (OP) #

Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

replies(7): >>44484127 #>>44484154 #>>44484177 #>>44484296 #>>44484501 #>>44484693 #>>44489367 #

deepdarkforest ◴[06 Jul 25 21:17 UTC] No.44484127[source]▶

>>44483936 #

It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient

replies(2): >>44484514 #>>44484896 #

BlueFalconHD ◴[06 Jul 25 22:06 UTC] No.44484514[source]▶

>>44484127 #

Yep. These filters are applied first before the safety model (still figuring out the architecture, I am pretty confident it is an LLM combined with some text classification) runs.

replies(1): >>44484674 #

1. brookst ◴[06 Jul 25 22:25 UTC] No.44484674[source]▶

>>44484514 #

All commercial LLM products I’m aware of use dedicated safety classifiers and then alter the prompt to the LLM if a classifier is tripped.

replies(1): >>44485031 #

2. latency-guy2 ◴[06 Jul 25 23:17 UTC] No.44485031[source]▶

>>44484674 (TP) #

The safety filter appears on both ends (or multi-ended depending on the complexity of your application), input and output.

I can tell you from using Microsoft's products that safety filters appears in a bunch of places. M365 for example, your prompts are never totally your prompts, every single one gets rewritten. It's detailed here: https://learn.microsoft.com/en-us/copilot/microsoft-365/micr...

There's a more illuminating image of the Copilot architecture here: https://i.imgur.com/2vQYGoK.png which I was able to find from https://labs.zenity.io/p/inside-microsoft-365-copilot-techni...

The above appears to be scrubbed, but it used to be available from the learn page months ago. Your messages get additional context data from Microsoft's Graph, which powers the enterprise version of M365 Copilot. There's significant benefits to this, and downsides. And considering the way Microsoft wants to control things, you will get an overindex toward things that happen inside of your organization than what will happen in the near real-time web.

↑