Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

(machinelearning.apple.com)

52 points layer8 | 2 comments | 14 Apr 25 20:04 UTC | HN request time: 0.571s | source

Show context

jsenn ◴[14 Apr 25 22:34 UTC] No.43687027[source]▶

> This approach works by randomly polling participating devices for whether they’ve seen a particular fragment, and devices respond anonymously with a noisy signal. By noisy, we mean that devices may provide the true signal of whether a fragment was seen or a randomly selected signal for an alternative fragment or no matches at all. By calibrating how often devices send randomly selected responses, we ensure that hundreds of people using the same term are needed before the word can be discoverable. As a result, Apple only sees commonly used prompts, cannot see the signal associated with any particular device, and does not recover any unique prompts. Furthermore, the signal Apple receives from the device is not associated with an IP address or any ID that could be linked to an Apple Account. This prevents Apple from being able to associate the signal to any particular device.

The way I read this, there's no discovery mechanism here, so Apple has to guess a priori which prompts will be popular. How do they know what queries to send?

replies(3): >>43687064 #>>43689472 #>>43701101 #

1. vineyardmike ◴[15 Apr 25 06:07 UTC] No.43689472[source]▶

>>43687027 #

I think the do guess a priori what to query...

Later in the article, for a different (but similar) feature:

> To curate a representative set of synthetic emails, we start by creating a large set of synthetic messages on a variety of topics... We then derive a representation, called an embedding, of each synthetic message that captures some of the key dimensions of the message like language, topic, and length. These embeddings are then sent to a small number of user devices that have opted in to Device Analytics.

It's crazy to think Apple is constantly asking my iPhone if I ever write emails similar to emails about tennis lessons (their example). This feels like the least efficient way to understand users in this context. Especially considering they host an email server!

replies(1): >>43691627 #

2. jsenn ◴[15 Apr 25 12:15 UTC] No.43691627[source]▶

>>43689472 (TP) #

yeah, the linked paper [1] has more detail--basically they seem to start with a seed set of "class labels" and subcategories (e.g. "restaurant review" + "steak house"). They ask an LLM to generate lots of random texts incorporating those labels. They make a differentially private histogram of embedding similarities from those texts with the private data, then use that histogram to resample the texts, which become the seeds for the next iteration, sort of like a Particle Filter.

I'm still unclear on how you create that initial set of class labels used to generate the random seed texts, and how sensitive the method is to that initial corpus.

[1] https://arxiv.org/abs/2403.01749

↑