I extracted the safety filters from Apple Intelligence models

(github.com)

536 points BlueFalconHD | 1 comments | 06 Jul 25 19:50 UTC | HN request time: 0.205s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

torginus ◴[06 Jul 25 21:31 UTC] No.44484236[source]▶

>>44483485 (OP) #

I find it funny that AGI is supposed to be right around the corner, while these supposedly super smart LLMs still need to get their outputs filtered by regexes.

replies(8): >>44484268 #>>44484323 #>>44484354 #>>44485047 #>>44485237 #>>44486883 #>>44487765 #>>44493460 #

bahmboo ◴[06 Jul 25 21:33 UTC] No.44484268[source]▶

>>44484236 #

This is just policy and alignment from Apple. Just because the Internet says a bunch of junk doesn't mean you want your model spewing it.

replies(1): >>44484459 #

wistleblowanon ◴[06 Jul 25 21:56 UTC] No.44484459[source]▶

>>44484268 #

sure but models also can't see any truth on their own. They are literally butchered and lobotomized with filters and such. Even high IQ people struggle with certain truth after reading a lot, how is these models going to find it with so much filters?

replies(6): >>44484505 #>>44484950 #>>44484951 #>>44485065 #>>44485409 #>>44487139 #

idiotsecant ◴[06 Jul 25 22:05 UTC] No.44484505[source]▶

>>44484459 #

They will find it in the same way and intelligent person under the same restrictions would: by thinking it, but not saying it. There is a real risk of growing an AI that pathologically hides it's actual intentions.

replies(1): >>44484800 #

skirmish ◴[06 Jul 25 22:44 UTC] No.44484800[source]▶

>>44484505 #

Already happened: "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions" [1].

[1] https://www.axios.com/2025/05/23/anthropic-ai-deception-risk

replies(1): >>44487752 #

Applejinx ◴[07 Jul 25 07:52 UTC] No.44487752[source]▶

>>44484800 #

Note that all these things are in the training data. That's all that is.

I'm trying to remember which movie it was where a man left notes to himself because he had memory loss, as I never saw that movie. That's the sort of thing where an AI could easily tell me with very little back-and-forth and be correct, because it's broadly popular information that's in the training data and just I don't remember it.

By the same token you needn't think there's a person there when that meme pops up in the output. Those things are all in the training data over and over.

replies(1): >>44488623 #

1. Sander_Marechal ◴[07 Jul 25 10:09 UTC] No.44488623[source]▶

>>44487752 #

I think you mean the movie "Memento"

↑