Most active commenters
  • simonw(4)
  • fcarraldo(4)

←back to thread

439 points diggan | 21 comments | | HN request time: 0.451s | source | bottom
Show context
AlecSchueler ◴[] No.45062904[source]
Am I the only one that assumed everything was already being used for training?
replies(9): >>45062929 #>>45063168 #>>45063951 #>>45064966 #>>45065323 #>>45065428 #>>45065912 #>>45066950 #>>45070135 #
1. Aurornis ◴[] No.45065912[source]
I don't understand this mindset. Why would you assume anything? It took me a couple minutes at most to check when I first started using Claude.

I check when I start using any new service. The cynical assumption that everything's being shared leads to shrugging it off and making no attempt to look for settings.

It only takes a moment to go into settings -> privacy and look.

replies(7): >>45065932 #>>45065968 #>>45066053 #>>45066125 #>>45068206 #>>45068998 #>>45070223 #
2. hshdhdhj4444 ◴[] No.45065932[source]
Huh, they’re not assuming anything is “being shared”.

They’re assuming that Anthropic that is already receiving and storing your data, is also training their models on that data.

How are you supposed to disprove that as a user?

Also, the whole point is that companies cannot be trusted to follow the settings.

replies(1): >>45067803 #
3. Capricorn2481 ◴[] No.45065968[source]
> It only takes a moment to go into settings -> privacy and look.

Do you have any reason to think this does anything?

replies(1): >>45066045 #
4. serial_dev ◴[] No.45066045[source]
Jira ticket Nr 97437838. Training service ignores settings, trains on your data anyway. Priority: extremely low. Will probably do it in 2031 when the intern joins.
replies(1): >>45066557 #
5. efficax ◴[] No.45066053[source]
A silicon valley startup would never say one thing and then do another!
6. lbrito ◴[] No.45066125[source]
>Why would you assume anything?

Because they already used data without permission on a much larger scale, so it's a perfectly logical assumption that they would continue doing so with their users?

replies(1): >>45067797 #
7. nbulka ◴[] No.45066557{3}[source]
!!!!!!!!!! this... all the times HIPAA and data privacy laws get ignored directly in Jira tickets too. SMH
8. simonw ◴[] No.45067797[source]
I don't think that logically makes sense.

Training on everything you can publicly scrape from the internet is a very different thing from training on data that your users submit directly to your service.

replies(2): >>45069962 #>>45070009 #
9. simonw ◴[] No.45067803[source]
Why can't companies be trusted to follow the settings?

If they add those settings why would you expect they wouldn't respect them? Do you think they're purely cosmetic features that don't actually do anything?

replies(3): >>45070033 #>>45070257 #>>45080866 #
10. UltraSane ◴[] No.45068206[source]
Because the demand for training data is insatiable and they already are using basically everything available and they need more human generated data and chats with their own LLM is a perfect source.
11. themafia ◴[] No.45068998[source]
> I check when I start using any new service.

So your assumption is that the reported privacy policy of any company is completely accurate. There there is no means for the company to violate this policy and that once violated you will immediately be notified.

> It only takes a moment to go into settings -> privacy and look.

It only takes a moment to examine history and observe why this is wholly inadequate.

12. rpgbr ◴[] No.45069962{3}[source]
>Training on everything you can publicly scrape from the internet is a very different thing from training on data that your users submit directly to your service.

Yes. It's way easier and cheaper when the data comes to you instead of having to scrape everything elsewhere.

13. fcarraldo ◴[] No.45070009{3}[source]
OpenAI, Meta and X all train from user submitted data, in Meta and X’s case data that had been submitted long before the advent of LLMs.

It’s not a leap to assume Anthropic does the same.

replies(1): >>45072303 #
14. fcarraldo ◴[] No.45070033{3}[source]
Because they can’t be?

https://www.reuters.com/sustainability/boards-policy-regulat...

https://www.bbc.com/news/articles/cx2jmledvr3o

replies(1): >>45071304 #
15. sjapkee ◴[] No.45070223[source]
Bro really thinks privacy settings work
16. fcarraldo ◴[] No.45070257{3}[source]
Also currently being discussed[0], on this very site, is both speculation that Meta is surreptitiously scanning your camera roll and a comment claiming that they worked on an earlier implementation to do just that.

It’s shocking to me that anyone who works in our industry would trust any company to do as they claim.

[0] https://news.ycombinator.com/item?id=45062910

17. simonw ◴[] No.45071304{4}[source]
There is an enormous gap between the behavior covered in those two cases and training machine learning models on user data that a company has specifically said it will not use for training.
18. adastra22 ◴[] No.45072303{4}[source]
By X do you mean tweets? Can you not see how different that is from training on your private conversations with an LLM?

What if you ask it for medical advice, or legal things? What if you turn on Gmail integration? Should I now be able to generate your conversations with the right prompt?

replies(1): >>45085938 #
19. AlecSchueler ◴[] No.45080866{3}[source]
Have you really never heard of companies saying one thing while doing another?
replies(1): >>45081600 #
20. simonw ◴[] No.45081600{4}[source]
Yes, normally when they lose a lawsuit over it.
21. fcarraldo ◴[] No.45085938{5}[source]
I don't think AI companies should be doing this, but they are doing it. All are opt-out, not opt-in. Anthropic is just changing their policies to be the same as their competition.

xAI trains Grok on both public data (Tweets) and non-public data (Conversations with Grok) by default. [0]

> Grok.com Data Controls for Training Grok: For the Grok.com website, you can go to Settings, Data, and then “Improve the Model” to select whether your content is used for model training.

Meta trains its AI on things posted to Meta's products, which are not as "public" as Tweets on X, because users expect these to be shared only with their networks. They do not use DMs, but they do use posts to Instagram/Facebook/etc. [1]

> We use information that is publicly available online and licensed information. We also use information shared on Meta Products. This information could be things like posts or photos and their captions. We do not use the content of your private messages with friends and family to train our AIs unless you or someone in the chat chooses to share those messages with our AIs.

OpenAI uses conversations for training data by default [2]

> When you use our services for individuals such as ChatGPT, Codex, and Sora, we may use your content to train our models.

> You can opt out of training through our privacy portal by clicking on “do not train on my content.” To turn off training for your ChatGPT conversations and Codex tasks, follow the instructions in our Data Controls FAQ. Once you opt out, new conversations will not be used to train our models.

[0] https://x.ai/legal/faq

[1] https://www.facebook.com/privacy/genai/

[2] https://help.openai.com/en/articles/5722486-how-your-data-is...