I extracted the safety filters from Apple Intelligence models

(github.com)

534 points BlueFalconHD | 2 comments | 06 Jul 25 19:50 UTC | HN request time: 0.465s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

trebligdivad ◴[06 Jul 25 20:56 UTC] No.44483981[source]▶

>>44483485 (OP) #

Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

replies(11): >>44483999 #>>44484073 #>>44484095 #>>44484410 #>>44484636 #>>44486072 #>>44487916 #>>44488185 #>>44488279 #>>44488362 #>>44488856 #

grues-dinner ◴[06 Jul 25 21:09 UTC] No.44484073[source]▶

>>44483981 #

Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

replies(11): >>44484164 #>>44484360 #>>44484635 #>>44484665 #>>44485033 #>>44485034 #>>44486246 #>>44487244 #>>44488055 #>>44488114 #>>44500918 #

qingcharles ◴[06 Jul 25 21:22 UTC] No.44484164[source]▶

>>44484073 #

It's totally performative. There's no way to stay ahead of the new language that people create.

At what point do the new words become the actual words? Are there many instances of people using unalive IRL?

replies(17): >>44484171 #>>44484218 #>>44484614 #>>44484958 #>>44484970 #>>44484989 #>>44485202 #>>44485277 #>>44485309 #>>44486128 #>>44486394 #>>44487625 #>>44487839 #>>44487936 #>>44488097 #>>44488704 #>>44493436 #

Terr_ ◴[06 Jul 25 23:08 UTC] No.44484970[source]▶

>>44484164 #

> There's no way to stay ahead of the new language that people create.

I'm imagining a new exploit: After someone says something totally innocent, people gang up in the comments to act like a terrible vicious slur has been said, and then the moderation system (with an LLM involved somewhere) "learns" that an arbitrary term is heinous eand indirectly bans any discussion of that topic.

replies(5): >>44485038 #>>44485110 #>>44485356 #>>44486827 #>>44486843 #

grues-dinner ◴[07 Jul 25 04:47 UTC] No.44486843[source]▶

>>44484970 #

The first half of that already happened with the OK gesture: https://www.bbc.co.uk/news/newsbeat-49837898.

Though it would be fun to see what happens if an LLM if used to ban anything that tends to generate heated exchanges. It would presumably learn to ban racial terms, politics and politicians and words like "immigrant" (i.e. basically the list in this repo), but what else could it be persuaded to ban? Vim and Emacs? SystemD? Anything involving cyclists? Parenting advice?

replies(3): >>44487593 #>>44488628 #>>44488746 #

immibis ◴[07 Jul 25 07:25 UTC] No.44487593[source]▶

>>44486843 #

People weren't using the OK gesture innocently. After 4chan trolls decided to start pretending it was a white supremacist symbol, actual white supremacists started using it as a symbol.

replies(2): >>44488027 #>>44488626 #

coldtea ◴[07 Jul 25 08:34 UTC] No.44488027[source]▶

>>44487593 #

All 10 of them?

What about the other 7-8 billion people still using it normally?

replies(2): >>44488487 #>>44498920 #

thephyber ◴[07 Jul 25 09:45 UTC] No.44488487[source]▶

>>44488027 #

Some were using it in the traditional unironic (and IMHO cringe) way, similar to anyone who used the phrase “Let’s go, Brandon!” Before that NASCAR race when MAGAs adopted it as ironic + coded vice signaling.

Quit being overly pedantic. We all knew there was an unironic purpose for the gesture before it became ironic.

replies(1): >>44495169 #

1. coldtea ◴[07 Jul 25 22:14 UTC] No.44495169[source]▶

>>44488487 #

I mean, advice from a person who considers the traditional unironic use of OK as "cringe"...

Whatever dude

replies(1): >>44529212 #

2. thephyber ◴[11 Jul 25 07:14 UTC] No.44529212[source]▶

>>44495169 (TP) #

I haven’t used the “ok” gesture unironically since I watched Beverly Hills Cop (1984).

↑