(github.com)

534 points BlueFalconHD | 1 comments | 06 Jul 25 19:50 UTC | HN request time: 0s | source

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Show context

trebligdivad ◴[06 Jul 25 20:56 UTC] No.44483981[source]▶

>>44483485 (OP) #

Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

replies(11): >>44483999 #>>44484073 #>>44484095 #>>44484410 #>>44484636 #>>44486072 #>>44487916 #>>44488185 #>>44488279 #>>44488362 #>>44488856 #

grues-dinner ◴[06 Jul 25 21:09 UTC] No.44484073[source]▶

>>44483981 #

Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

replies(11): >>44484164 #>>44484360 #>>44484635 #>>44484665 #>>44485033 #>>44485034 #>>44486246 #>>44487244 #>>44488055 #>>44488114 #>>44500918 #

qingcharles ◴[06 Jul 25 21:22 UTC] No.44484164[source]▶

>>44484073 #

It's totally performative. There's no way to stay ahead of the new language that people create.

At what point do the new words become the actual words? Are there many instances of people using unalive IRL?

replies(17): >>44484171 #>>44484218 #>>44484614 #>>44484958 #>>44484970 #>>44484989 #>>44485202 #>>44485277 #>>44485309 #>>44486128 #>>44486394 #>>44487625 #>>44487839 #>>44487936 #>>44488097 #>>44488704 #>>44493436 #

Terr_ ◴[06 Jul 25 23:08 UTC] No.44484970[source]▶

>>44484164 #

> There's no way to stay ahead of the new language that people create.

I'm imagining a new exploit: After someone says something totally innocent, people gang up in the comments to act like a terrible vicious slur has been said, and then the moderation system (with an LLM involved somewhere) "learns" that an arbitrary term is heinous eand indirectly bans any discussion of that topic.

replies(5): >>44485038 #>>44485110 #>>44485356 #>>44486827 #>>44486843 #

SXX ◴[07 Jul 25 04:42 UTC] No.44486827[source]▶

>>44484970 #

It's not like this unique to LLMs either. By some little trolling on internet you easily can turn hand "OK gesture" into a hate symbol of white supermacy. And fools will fall for it.

replies(4): >>44487124 #>>44488042 #>>44488942 #>>44488953 #

overfeed ◴[07 Jul 25 05:41 UTC] No.44487124{3}[source]▶

>>44486827 #

...and then the bigots will fall for it too, and start using it in earnest, completing the cycle.

replies(1): >>44488051 #

coldtea ◴[07 Jul 25 08:37 UTC] No.44488051{4}[source]▶

>>44487124 #

who cares what the bigots use?

If the bigots start using "thank you" as some code word, should we stop saying it, lest we pollute our non-bigoted discussions?

bigots drink coffee too, maybe we should stop drinking it, because something-something...

replies(4): >>44488318 #>>44488783 #>>44488935 #>>44489218 #

1. Eisenstein ◴[07 Jul 25 09:19 UTC] No.44488318{5}[source]▶

>>44488051 #

It's all context dependent. There can be words or symbols which are totally benign but when used in a different context do have impactful meaning. Case in point: cheese pizza.

↑

I extracted the safety filters from Apple Intelligence models