I extracted the safety filters from Apple Intelligence models

1. binarymax ◴[06 Jul 25 20:50 UTC] No.44483936[source]▶

Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

replies(7): >>44484127 #>>44484154 #>>44484177 #>>44484296 #>>44484501 #>>44484693 #>>44489367 #

2. deepdarkforest ◴[06 Jul 25 21:17 UTC] No.44484127[source]▶

>>44483936 (TP) #

It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient

replies(2): >>44484514 #>>44484896 #

3. miohtama ◴[06 Jul 25 21:20 UTC] No.44484154[source]▶

>>44483936 (TP) #

Sounds like UK politics is taboo?

replies(1): >>44484977 #

4. tpmoney ◴[06 Jul 25 21:24 UTC] No.44484177[source]▶

>>44483936 (TP) #

I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.

replies(1): >>44484332 #

5. bigyabai ◴[06 Jul 25 21:36 UTC] No.44484296[source]▶

>>44483936 (TP) #

> If things are like this at Apple I’m not sure what to think.

I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.

6. XorNot ◴[06 Jul 25 21:40 UTC] No.44484332[source]▶

>>44484177 #

It would also substantially disrupt the generation process: a model which sees B0ris and not Boris is going to struggle to actually associate that input to the politician since it won't be well represented in the training set (and on the output side the same: if it does make the association, a reasoning model for example would include the proper name in the output first at which point the supervisor process can reject it).

replies(3): >>44484499 #>>44484952 #>>44485371 #

7. quonn ◴[06 Jul 25 22:03 UTC] No.44484499{3}[source]▶

>>44484332 #

I don‘t think so. My impression with LLMs is that they correct typos well. I would imagine this happens in early layers without much impact on the remaining computation.

8. stefan_ ◴[06 Jul 25 22:03 UTC] No.44484501[source]▶

>>44483936 (TP) #

Why are these things always so deeply unserious? Is there no one working on "safety in AI" (oxymoron in itself of course) that has a meaningful understanding of what they are actually working with and an ability beyond an interns weekend project? Reminds me of the cybersecurity field that got the 1% of people able to turn a double free into code execution while 99% peddle checklists, "signature scanning" and deal in CVE numbers.

Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.

9. BlueFalconHD ◴[06 Jul 25 22:06 UTC] No.44484514[source]▶

>>44484127 #

Yep. These filters are applied first before the safety model (still figuring out the architecture, I am pretty confident it is an LLM combined with some text classification) runs.

replies(1): >>44484674 #

10. brookst ◴[06 Jul 25 22:25 UTC] No.44484674{3}[source]▶

>>44484514 #

All commercial LLM products I’m aware of use dedicated safety classifiers and then alter the prompt to the LLM if a classifier is tripped.

replies(1): >>44485031 #

11. Aeolun ◴[06 Jul 25 22:29 UTC] No.44484693[source]▶

>>44483936 (TP) #

The LLM will. But the image generation model that is trained on a bunch of pre-specified tags will almost immediately spit out unrecognizable results.

12. twoodfin ◴[06 Jul 25 22:57 UTC] No.44484896[source]▶

>>44484127 #

Efficient at what?

13. lupire ◴[06 Jul 25 23:05 UTC] No.44484952{3}[source]▶

>>44484332 #

"Draw a picture of a gorgon with the face of the 2024 Prime Minister of UK."

replies(1): >>44488039 #

14. immibis ◴[06 Jul 25 23:10 UTC] No.44484977[source]▶

>>44484154 #

All politics is taboo, except the sort that helps Apple get richer. (Or any other company, in that company's "safety" filters)

15. latency-guy2 ◴[06 Jul 25 23:17 UTC] No.44485031{4}[source]▶

>>44484674 #

The safety filter appears on both ends (or multi-ended depending on the complexity of your application), input and output.

I can tell you from using Microsoft's products that safety filters appears in a bunch of places. M365 for example, your prompts are never totally your prompts, every single one gets rewritten. It's detailed here: https://learn.microsoft.com/en-us/copilot/microsoft-365/micr...

There's a more illuminating image of the Copilot architecture here: https://i.imgur.com/2vQYGoK.png which I was able to find from https://labs.zenity.io/p/inside-microsoft-365-copilot-techni...

The above appears to be scrubbed, but it used to be available from the learn page months ago. Your messages get additional context data from Microsoft's Graph, which powers the enterprise version of M365 Copilot. There's significant benefits to this, and downsides. And considering the way Microsoft wants to control things, you will get an overindex toward things that happen inside of your organization than what will happen in the near real-time web.

16. binarymax ◴[07 Jul 25 00:11 UTC] No.44485371{3}[source]▶

>>44484332 #

No it doesn't disrupt. This is a well known capability of LLMs. Most models don't even point out a mistake they just carry on.

https://chatgpt.com/share/686b1092-4974-8010-9c33-86036c88e7...

17. chgs ◴[07 Jul 25 08:35 UTC] No.44488039{4}[source]▶

>>44484952 #

There were two.

18. Lockal ◴[07 Jul 25 11:53 UTC] No.44489367[source]▶

>>44483936 (TP) #

What prevents Apple from applying a quick anti-typo LLM which restores B0ris, unalive, fixs tpyos, and replaces "slumbering steed" with a "sleeping horse", not just for censorship, but also to improve generation results?

replies(1): >>44491272 #

19. the_mar ◴[07 Jul 25 15:16 UTC] No.44491272[source]▶

>>44489367 #

why do you think this doesn't already exist?