←back to thread

604 points wyldfire | 1 comments | | HN request time: 0.352s | source
1. supermatt ◴[] No.26353533[source]
I am still not sure how and what google are doing to generate these cohorts or label a browser accordingly.

This article talks about simhash. My understanding is that simhash does NOT perform any such analysis but generates a fingerprint based on the content that is comparable to those of similar content - for example, a checksum of my HN homepage is different to yours, as it contains user-specific information. Simhash, however, gives us a comparable "fingerprint".

In short, this only works for identifying pages of similar content - by itself i cant see how it could be used for cohort analysis unless there is either a client-side ML model involved, or millions of simhashes are shipped with the browser.

I raised this question the other day, and a googler pointed me to the "code". That code contained no reference to an ML model, its construction, or datasets. The code also contained no pool of simhashes. To me, that means that there is no way to label the browser with cohorts. Furthermore, the code appears to generate these simhashes, and then sync them to google via (the account identifiable) chrome sync. It is there that analysis is performed. Maybe this is why it's called "Federated LEARNING", instead of "Federated INFERENCE"?

Is this truly what we can expect from this? That instead of google tracking our behaviour from ONLY their analytics and advertising partners, they will now be secretly collecting a hash of EVERY page we visit and sending it to google directly? How is this private? The whole point of a simhash is to establish similarity between pages - and google has a huge rainbow table of simhashes for their search.

I got downvoted for raising this concern before. I would appreciate if someone would tell me what I have wrong instead of piling on the downvotes. Noone seems to be talking about this.