←back to thread

452 points croes | 4 comments | | HN request time: 0s | source
Show context
Workaccount2 ◴[] No.43963737[source]
I have yet to see someone explain in detail how transformer model training works (showing they understand the technical nitty gritty and the overall architecture of transformers) and also layout a case for why it is clearly a violation of copyright.

You can find lots of people talking about training, and you can find lots (way more) of people talking about AI training being a violation of copyright, but you can't find anyone talking about both.

Edit: Let me just clarify that I am talking about training, not inference (output).

replies(10): >>43963777 #>>43963792 #>>43963801 #>>43963816 #>>43963830 #>>43963874 #>>43963886 #>>43963955 #>>43964102 #>>43965360 #
1. gitremote ◴[] No.43963955[source]
They never said model training is a violation of copyright. The ruling says model training on copyrighted material for analysis and research is NOT copyright infringement, but the commercial use of the resulting model is:

"When a model is deployed for purposes such as analysis or research… the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries."

replies(1): >>43964872 #
2. Workaccount2 ◴[] No.43964872[source]
The vast trove of copyright work has to refer to training. ChatGPT is likely on the order of 5-10TB in size. (Yes, Terabyte).

There are college kids with bigger "copyright collections" than that...

replies(1): >>43965192 #
3. gitremote ◴[] No.43965192[source]
No. The paragraph as a whole refers to the "outputs" of vast troves of copyrighted work.

Disk size is irrelevant. If you lossy-compress a copyrighted bitmap image to small JPEG image and then sell the JPEG image, it's still copyright infringement.

replies(1): >>43969201 #
4. nickpsecurity ◴[] No.43969201{3}[source]
I won't say it's irrelevant. How much you use is part of fair use considerations. Their huge collections of copyrighted works make them look worse in legal analyses.