Most active commenters
  • sillysaurusx(4)
  • (4)
  • terafo(3)
  • cubefox(3)

←back to thread

1311 points msoad | 29 comments | | HN request time: 1.072s | source | bottom
1. sillysaurusx ◴[] No.35393782[source]
On the legal front, I’ve been working with counsel to draft a counterclaim to Meta’s DMCA against llama-dl. (GPT-4 is surprisingly capable, but I’m talking to a few attorneys: https://twitter.com/theshawwn/status/1641841064800600070?s=6...)

An anonymous HN user named L pledged $200k for llama-dl’s legal defense: https://twitter.com/theshawwn/status/1641804013791215619?s=6...

This may not seem like much vs Meta, but it’s enough to get the issue into the court system where it can be settled. The tweet chain has the details.

The takeaway for you is that you’ll soon be able to use LLaMA without worrying that Facebook will knock you offline for it. (I wouldn’t push your luck by trying to use it for commercial purposes though.)

Past discussion: https://news.ycombinator.com/item?id=35288415

I’d also like to take this opportunity to thank all of the researchers at MetaAI for their tremendous work. It’s because of them that we have access to such a wonderful model in the first place. They have no say over the legal side of things. One day we’ll all come together again, and this will just be a small speedbump in the rear view mirror.

EDIT: Please do me a favor and skip ahead to this comment: https://news.ycombinator.com/item?id=35393615

It's from jart, the author of the PR the submission points to. I really had no idea that this was a de facto Show HN, and it's terribly rude to post my comment in that context. I only meant to reassure everyone that they can freely hack on llama, not make a huge splash and detract from their moment on HN. (I feel awful about that; it's wonderful to be featured on HN, and no one should have to share their spotlight when it's a Show HN. Apologies.)

replies(7): >>35393813 #>>35393848 #>>35394028 #>>35394029 #>>35394084 #>>35394156 #>>35394431 #
2. terafo ◴[] No.35393813[source]
Wish you all luck in the world. We need much more clarity in legal status of these models.
replies(1): >>35393827 #
3. sillysaurusx ◴[] No.35393827[source]
Thanks! HN is pretty magical. I think they saw https://news.ycombinator.com/item?id=35288534 and decided to fund it.

I’m grateful for the opportunity to help protect open source projects such as this one. It will at least give Huggingface a basis to resist DMCAs in the short term.

replies(1): >>35393991 #
4. sheeshkebab ◴[] No.35393848[source]
All models trained on public data need to be made public. As it is their outputs are not copyrightable, it’s not a stretch to say models are public domain.
replies(3): >>35393876 #>>35394018 #>>35407677 #
5. sillysaurusx ◴[] No.35393876[source]
I’m honestly not sure. RLHF seems particularly tricky —- if someone is shaping a model by hand, it seems reasonable to extend copyright protection to them.

For the moment, I’m just happy to disarm corporations from using DMCAs against open source projects. The long term implications will be interesting.

6. ◴[] No.35393991{3}[source]
7. xoa ◴[] No.35394018[source]
You seem to be mixing a few different things together here. There's a huge leap from something not being copyrightable to saying there is grounds for it to be made public. No copyright would greatly limit the ability of model makers to legally restrict distribution if they made it to the public, but they'd be fully within their rights to keep them as trade secrets to the best of their ability. Trade secret law and practice is its own thing separate from copyright, lots of places have private data that isn't copyrightable (pure facts) but that's not the same as it being made public. Indeed part of the historic idea of certain areas of IP like patents was to encourage more stuff to be made public vs kept secret.

>As it is their outputs are not copyrightable, it’s not a stretch to say models are public domain.

With all respect this is kind of nonsensical. "Public domain" only applies to stuff that is copyrightable, if they simply aren't then it just never enters into the picture. And it not being patentable or copyrightable doesn't mean there is any requirement to share it. If it does get out though then that's mostly their own problem is all (though depending on jurisdiction and contract whoever did the leaking might get in trouble), and anyone else is free to figure it out on their own and share that and they can't do anything.

replies(1): >>35394844 #
8. cubefox ◴[] No.35394028[source]
Even if using LLaMA turns out to be legal, I very much doubt it is ethical. The model got leaked while it was only intended for research purposes. Meta engineered and paid for the training of this model. It's theirs.
replies(5): >>35394052 #>>35394067 #>>35394111 #>>35394143 #>>35394388 #
9. ◴[] No.35394029[source]
10. Uupis ◴[] No.35394052[source]
I feel like most-everything about these models gets really ethically-grey — at worst — very quickly.
11. willcipriano ◴[] No.35394067[source]
What did they train it on?
replies(1): >>35394204 #
12. ◴[] No.35394084[source]
13. faeriechangling ◴[] No.35394111[source]
Did Meta ask permission from every user they trained their model on? Did all those users consent, and when I say consent I'm saying was there a meeting of minds not something buried in page 89 of a EULA, to Meta building an AI with their data?

Turnabout is fair play. I don't feel the least bit sorry for Meta.

replies(3): >>35394149 #>>35394190 #>>35394302 #
14. dodslaser ◴[] No.35394143[source]
Meta as a company has shown pretty blatantly that they don't really care about ethitcs, nor the law for that sake.
15. terafo ◴[] No.35394149{3}[source]
LLaMa was trained on data of Meta users, though.
replies(1): >>35399778 #
16. electricmonk ◴[] No.35394156[source]
IANYL - This is not legal advice.

As you may be aware, a counter-notice that meets the statutory requirements will result in reinstatement unless Meta sues over it. So the question isn't so much whether your counter-notice covers all the potential defenses as whether Meta is willing to sue.

The primary hurdle you're going to face is your argument that weights are not creative works, and not copyrightable. That argument is unlikely to succeed for the the following reasons (just off the top of my head): (i) The act of selecting training data is more akin to an encyclopedia than the white pages example you used on Twitter, and encyclopedias are copyrightable as to the arrangement and specific descriptions of facts, even though the underlying facts are not; and (ii) LLaMA, GPT-N, Bard, etc, all have different weights, different numbers of parameters, different amounts of training data, and different tuning, which puts paid to the idea that there is only one way to express the underlying ideas, or that all of it is necessarily controlled by the specific math involved.

In addition, Meta has the financial wherewithal to crush you even were you legally on sound footing.

The upshot of all of this is that you may win for now if Meta doesn't want to file a rush lawsuit, but in the long run, you likely lose.

replies(1): >>35395300 #
17. cubefox ◴[] No.35394190{3}[source]
But it doesn't copy any text one to one. The largest one was trained on 1.4 trillion tokens, if I recall correctly, but the model size is just 65 billion parameters. (I believe they use 16 bit per token and parameter.) It seems to be more like a human who has read large parts of the internet, but doesn't remember anything word by word. Learning from reading stuff was never considered a copyright violation.
replies(2): >>35394552 #>>35395849 #
18. cubefox ◴[] No.35394204{3}[source]
On partly copyrighted text. Same as you and me.
19. shepardrtc ◴[] No.35394302{3}[source]
They don't ask permission when they're stealing users' data, so why should users ask permission for stealing their data?

https://www.usatoday.com/story/tech/2022/09/22/facebook-meta...

20. seydor ◴[] No.35394388[source]
It's an index of the web and our own comments, barely something they can claim ownership on , and especially to resell.

But OTOH, by preventing commercial use, they have sparked the creation of an open source ecosystem where people are building on top of it because it's fun, not because they want to build a moat to fill it with sweet VC $$$money.

It's great to see that ecosystem being built around it, and soon someone will train a fully open source model to replace Llama

21. sva_ ◴[] No.35394431[source]
Thank you for putting your ass on the line and deciding to challenge $megacorp on their claims of owning the copyright on NN weights that have been trained on public (and probably, to some degree, also copyrighted) data. This seems to very much be uncharted territory in the legal space, so there are a lot of unknowns.

I don't consider it ethical to compress the corpus of human knowledge into some NN weights and then closing those weights behind proprietary doors, and I hope that legislators will see this similarly.

My only worry is that they'll get you on some technicality, like that (some version of) your program used their servers afaik.

22. Avicebron ◴[] No.35394552{4}[source]
> It seems to be more like a human who has read large parts of the internet, but doesn't remember anything word by word. Learning from reading stuff was never considered a copyright violation.

This is one of the most common talking points I see brought up, especially when defending things like ai "learning" from the style of artists and then being able to replicate that style. On the surface we can say, oh it's similar to a human learning from an art style and replicating it. But that implies that the program is functioning like a human mind (as far as I know the jury is still out on that and I doubt we know exactly how a human mind actually "learns" (I'm not a neuroscientist)).

Let's say for the sake of experiment I ask you to cut out every word of pride and prejudice, and keep them all sorted. Then when asked to write a story in the style of jane austen you pull from that pile of snipped out words and arranged them in a pattern that most resembles her writing, did you transform it? Sure maybe, if a human did that I bet they could even copyright it, but I think that as a machine, it took those words, phrases, and applied an algorithm to generating output, even with stochastic elements the direct backwards traceability albeit a 65B convolution of it means that the essence of the copyrighted materials has been directly translated.

From what I can see we can't prove the human mind is strictly deterministic. But an ai very well might be in many senses. So the transference of non-deterministic material (the original) through a deterministic transform has to root back to the non-deterministic model (the human mind and therefore the original copyright holder).

23. sheeshkebab ◴[] No.35394844{3}[source]
Public domain applies to uncopyrightable works, among other things (including previously copyrighted works). In this case models are uncopyrightable, and I think FB (or any of these newfangled ai cos) would have interesting time proving otherwise, if they ever try.

https://en.m.wikipedia.org/wiki/Public_domain

24. 8note ◴[] No.35395300[source]
I think the counter on those arguments is that LLM owners want to avoid arguing that the model is a derivative work of the training data.

If the LLM is a specific arrangement of the copyrighted works, it's very clearly a derivative work of them

replies(1): >>35397215 #
25. ◴[] No.35395849{4}[source]
26. electricmonk ◴[] No.35397215{3}[source]
I was not suggesting that an LLM itself consists of an arrangement of the copyrighted works comprising the training data, but that the specific selection of the copyrighted works comprising the training data is part of what differentiates one LLM from another. A strained but useful analogy might be to think of the styles of painting an artist is trained in and/or exposed to prior to creating their own art. Obvious or subtle, the art style an artist has studied would likely impact the style they develop for themself.

However, to address your point about derivative works directly, the consensus among copyright law experts appears to be that whether a particular model output is infringing depends on the standard copyright infringement analysis (and that’s regardless of the minor and correctable issue represented by memorization/overfitting of duplicate data in training sets). Only in the most unserious legal complaint (the class action filed against Midjourney, Stability AI, etc.) is the argument being made and that the models actually contain copies of the training data.

replies(1): >>35398908 #
27. sillysaurusx ◴[] No.35398908{4}[source]
I just want to say that I appreciate the legal analysis. Thanks for your time.

If you ever come up with more hypothetical arguments in favor of NNs being copyrightable, please let me know. Or post them somewhere.

28. terafo ◴[] No.35399778{4}[source]
I was sleepy, I meant to say that it WASN'T trained on data of Meta users.
29. __turbobrew__ ◴[] No.35407677[source]
Aggregating and organizing public knowledge is a fundamentally valuable action which many companies make their business off of.

If I create a website for tracking real estate trends in my area — which is public information — should I not be able to sell that information?

Similarly if a consulting company analyzes public market macro trends are they not allowed to sell that information?

Just because the information which is being aggregated and organized is public does not necessarily mean that the output product should be in the public.