We can already have different licenses for compiled binaries vs the source. Also the output of ML seems to belong to whoever pressed the generate button atm.
Not really. The reason software can be copyrighted at all is because the actual code (and resulting object code) is creative. Courts have named this threshold the "Structure, sequence and organization" of the work. ML models don't follow any creative SSO the way actual code does.
> Also the output of ML seems to belong to whoever pressed the generate button atm.
The output, it seems to me, is uncopyrightable. Copyright only cares about who provides the creativity for the work at issue, not who put in the effort to make it happen. You may own the copyright to your prompt, but the result is generated entirely by the AI and thus lacks human autorship.
It's possible that the automated processing of the dataset is considered to be non-creative enough that the generated AI model cannot be copyrighted. The code to train the model and the input dataset (and the works therein) definitely can be, but not the model itself.
In that case, Facebook would be out of luck, as long as the code to train the model isn't shared. If the courts find AI models to be a different type of work that does produce copyrightable models, Facebook may follow in the footsteps of other copyright giants and start filing lawsuits against anyone who they can catch. I very much doubt they'd go so far, especially since by the time they can even start a lawsuit confidently, the leaked model is probably already outdated and irrelevant.
Personally, I expect the model to end up being uncopyrightable, as would be the output of the model.
This may or may not have very interesting results. The dataset itself is probably copyrightable (a human or set of humans composed it, unless that was also done completely automatically) but if that copyright is claimed, the individual right holders of the included works may demand a licensing fee similar to how sound bytes work in music; "you want to use my work, pay me a fee".
Or maybe the dataset is considered to be diverse enough that individual works cannot be expected to be compensated for their inclusion and you can get around copyright law by amassing enough content at once, who knows.
Not too familiar with the drama but I believe what happened was that someone with access leaked the torrent used to download the weights. In a legal sense this would be similar to someone say leaking a Google Drive link containing prop information that was only intended to be shared with vendors.
https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...
There isn't any confidentiality terms.
So far the rulings in the US, at least, do not support this.
https://arstechnica.com/information-technology/2023/02/us-co...
In this case, it was images generated via Midjourney and not the output of an LLM, but my layman's understanding of the result here would be equally applicable to LLM output. Effectively, the copyright office does not consider putting in a prompt enough for there to be "human authorship" of the work. In this specific case, that resulted on the images in the comic being considered uncopyrightable. The broader comic, in the organization of the images, the plot and dialogue, etc., still enjoys copyright protection. But in the US, I could just directly take the images in the comic that Midjourney produced and use them for another purpose without violating copyright.