Most active commenters

    ←back to thread

    747 points porridgeraisin | 13 comments | | HN request time: 1.234s | source | bottom
    Show context
    I_am_tiberius ◴[] No.45062905[source]
    In my opinion, training models on user data without their real consent (real consent = e.g. the user must sign a contract or so, so he's definitely aware), should be considered a serious criminal offense.
    replies(5): >>45062989 #>>45063008 #>>45063221 #>>45063771 #>>45064402 #
    1. jsheard ◴[] No.45062989[source]
    Why single out user data specifically? Most of the data Anthropic and co train on was just scooped up from wherever with zero consent, not even the courtesy of a buried TOS clause, and their users were always implicitly fine with that. Forgive me for not having much sympathy when the users end up reaping what they've sown.
    replies(3): >>45063012 #>>45063051 #>>45063335 #
    2. I_am_tiberius ◴[] No.45063012[source]
    100 % true.
    3. perihelions ◴[] No.45063051[source]
    Training on private user interactions is a privacy violation; training on public, published texts is (some argue) an intellectual property violation. They're very different kinds of moral rights.
    replies(2): >>45063481 #>>45064161 #
    4. __MatrixMan__ ◴[] No.45063335[source]
    Publishing something is considered by most to be sufficient consent for it to be not considered private.

    I realize there's a whole legal quagmire here involved with intellectual "property" and what counts as "derivative work", but that's a whole separate (and dubiously useful) part of the law.

    replies(1): >>45063793 #
    5. diggan ◴[] No.45063481[source]
    Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.
    replies(2): >>45063659 #>>45063756 #
    6. ◴[] No.45063659{3}[source]
    7. dmbche ◴[] No.45063756{3}[source]
    Look at the suits against them they list it there
    replies(1): >>45063910 #
    8. chamomeal ◴[] No.45063793[source]
    That is definitely normally true but I feel like the scale and LLM usage turns it into a different problem.

    If you can use all of the content of stack overflow to create a “derivative work” that replaces stack overflow, and causes it to lose tons of revenue, is it really a derivative work?

    I’m pretty sure solution sites like chegg don’t include the actual questions for that reason. The solutions to the questions are derivative, but the questions aren’t.

    replies(2): >>45063899 #>>45064495 #
    9. airstrike ◴[] No.45063899{3}[source]
    Replacing stack overflow has no bearing on the definition of "derivative"
    10. diggan ◴[] No.45063910{4}[source]
    Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.
    11. jsheard ◴[] No.45064161[source]
    I wish I could be so optimistic that there is no private information published unintentionally or maliciously on the open web where crawlers can find it.

    (and as diggan said, the web isn't the only source they use anyway. who knows what they're buying from data brokers.)

    12. __MatrixMan__ ◴[] No.45064495{3}[source]
    Stack overflow doesn't really have a legitimate claim to that data either though. Nor do the users, we're just pasting error messages and documentation. It's derivative all the way down. It'll never sit still and behave like property.

    Privacy makes sense, treating data like property does not.

    replies(1): >>45065641 #
    13. chamomeal ◴[] No.45065641{4}[source]
    Point taken, but it still feels like a gray area to me. The value that SO created was the curation of knowledge and high quality discussions that were well indexed and searchable.

    The users did provide the data, which is a good point. But there’s a reason SO was so useful to developers and quora was not. It also made it a perfect feeding ground for hungry LLMs.

    Then again I’m just guessing that big models are trained on SO. Maybe that’s not true