←back to thread

747 points porridgeraisin | 4 comments | | HN request time: 0.637s | source
Show context
I_am_tiberius ◴[] No.45062905[source]
In my opinion, training models on user data without their real consent (real consent = e.g. the user must sign a contract or so, so he's definitely aware), should be considered a serious criminal offense.
replies(5): >>45062989 #>>45063008 #>>45063221 #>>45063771 #>>45064402 #
jsheard ◴[] No.45062989[source]
Why single out user data specifically? Most of the data Anthropic and co train on was just scooped up from wherever with zero consent, not even the courtesy of a buried TOS clause, and their users were always implicitly fine with that. Forgive me for not having much sympathy when the users end up reaping what they've sown.
replies(3): >>45063012 #>>45063051 #>>45063335 #
perihelions ◴[] No.45063051[source]
Training on private user interactions is a privacy violation; training on public, published texts is (some argue) an intellectual property violation. They're very different kinds of moral rights.
replies(2): >>45063481 #>>45064161 #
1. diggan ◴[] No.45063481[source]
Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.
replies(2): >>45063659 #>>45063756 #
2. ◴[] No.45063659[source]
3. dmbche ◴[] No.45063756[source]
Look at the suits against them they list it there
replies(1): >>45063910 #
4. diggan ◴[] No.45063910[source]
Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.