(www.anthropic.com)

747 points porridgeraisin | 4 comments | 29 Aug 25 11:29 UTC | HN request time: 0.637s | source

Show context

I_am_tiberius ◴[29 Aug 25 12:01 UTC] No.45062905[source]▶

In my opinion, training models on user data without their real consent (real consent = e.g. the user must sign a contract or so, so he's definitely aware), should be considered a serious criminal offense.

replies(5): >>45062989 #>>45063008 #>>45063221 #>>45063771 #>>45064402 #

jsheard ◴[29 Aug 25 12:08 UTC] No.45062989[source]▶

>>45062905 #

Why single out user data specifically? Most of the data Anthropic and co train on was just scooped up from wherever with zero consent, not even the courtesy of a buried TOS clause, and their users were always implicitly fine with that. Forgive me for not having much sympathy when the users end up reaping what they've sown.

replies(3): >>45063012 #>>45063051 #>>45063335 #

perihelions ◴[29 Aug 25 12:14 UTC] No.45063051[source]▶

>>45062989 #

Training on private user interactions is a privacy violation; training on public, published texts is (some argue) an intellectual property violation. They're very different kinds of moral rights.

replies(2): >>45063481 #>>45064161 #

1. diggan ◴[29 Aug 25 12:54 UTC] No.45063481[source]▶

>>45063051 #

Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.

replies(2): >>45063659 #>>45063756 #

2. ◴[29 Aug 25 13:11 UTC] No.45063659[source]▶

>>45063481 (TP) #

3. dmbche ◴[29 Aug 25 13:19 UTC] No.45063756[source]▶

>>45063481 (TP) #

Look at the suits against them they list it there

replies(1): >>45063910 #

4. diggan ◴[29 Aug 25 13:33 UTC] No.45063910[source]▶

>>45063756 #

Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.

↑

Updates to Consumer Terms and Privacy Policy