Copyright doesn’t cover not liking LLMs

I’ve been thinking a lot about LLMs being trained on content against the will of the content creator. I am very aware of the damage that can be done here, especially to small creators who don’t have a legal budget, and I want to protect their rights, and their opportunity to make a living with their content. But I don’t think, in most cases, these content creators have a right to prohibit their work from being used to train LLMs.

For the sake of argument, there are a few things we’ll ignore. First, clear infringement. If an LLM writes a full-length Hunger Games sequel with the same characters, in the same universe – this is clearly already covered by copyright, this is clearly infringement. Important but intellectually boring. Second, electricity needed to power the servers housing the LLMs. Also important, also boring from an intellectual property perspective.

Also, it’s not AI. I like “spicy autocomplete” but whatever you call it, it’s not “intelligence”. It’s simply making guesses based on all the content it has ingested. It can’t make new connections. This is GOOD – we’ve all seen Terminator and no one wants to live in that universe.

We will also assume that the content has been obtained legally. Unauthorized content is a problem but also uninteresting in this context. People getting content through unauthorized means was a problem before LLMs and will be a problem going forward, even if LLMs disappeared today.

So take an anecdote. Let’s say I am a huge fan of Stephen King. I can read all his books (even the ones my friend’s mom swore were written by his wife). This will surely influence my writing style (and in fact it has, because I AM a fan of Stephen King, and have read dozens of his books. It would influence my fiction even more if I got around to writing any with any sort of frequency). This is clearly not any sort of copyright infringement. So, training your LLM on legally obtained copyrighted content is ALSO not copyright infringement.

Next, with my newly earned writing chops, I can write a 1,500 page sequel to The Stand. If I’m good enough, it will sound a bit like he wrote it. If I keep this on my laptop and only read it to pat myself on the back, this is completely legal and does not infringe on his copyright in any way.

Now I try to sell The Stand II – Standoff under my new pen name, Steven Kimg. This is VERY CLEARLY copyright infringement (and remains so even if I’m a bit more subtle with my marketing). Enforcement of these laws is hard, but it’s not impossible. I’m in favor of better enforcement of these laws to protect content creators, but that has little to do with LLMs. Ask any author how many infringing copies of their book were available on Amazon 3 years ago, before LLMs were mainstream.

What if my friend, who is ALSO a King superfan, pays me to write the book? He plans to keep it for himself and not show it to anyone else. For someone like Stephen King, this is too small to matter. He would probably be annoyed at me if he ever found out, but I can’t imagine he’d bother calling his lawyer. A small content creator might be angry, and justifiably so, but showing real damage would be difficult even though I think this is also copyright infringement.

But what LLMs are doing is largely not the same as any of the above. They are reading all of Stephen King, and all of Suzanne Collins, all of Tumblr and Reddit, and anything else they can get their “hands” on. This is literally exactly what humans do to develop their own craft, and I don’t think the volume at which the LLM may do this as opposed to the volume at which a human does it makes any difference to how the law applies. If I read a book and it influences my art, that is not copyright infringement. If I read 100 and they influence my art, still not infringement. 1,000? Still no. 1,000,000? Still no, though this would be a difficult feat for a human.

The problem that isn’t well covered by existing law is when the artist doesn’t want their work used to train these LLMs. I don’t think that is a protected right. It’s like when a politician licenses a song from the label and plays it at a rally. The artist gets mad because they disagree with the politics. The politician may get bad publicity for this, but they are 100% within their legal rights to continue using the song (again, assuming it’s legally licensed, because if it’s not then it’s not interesting to discuss, it’s just boring infringement). Another example – the creators of The Boys have complained that many people who watch the show come away thinking Homelander is the hero. He is quite obviously a deranged sociopath, though I absolutely love the character. But this is a similar case of authorized users of your content using it for something you hate (promoting sociopathic superheroes).

If we want to prevent this, we need new laws. Copyright is a giant hammer and modern content creation and sharing requires a much more versatile tool. Creative Commons tried to provide this and it caught on in some circles but never got the critical mass from big companies, probably because they’re just fine with the giant hammer – they have the legal resources to back it up and don’t much care about the collateral damage. I’m not optimistic we’ll resolve this – the Venn Diagram of those with the desire to change and the power to change is probably two separate circles. But maybe if we think about it this way, we can save some whining.

Is training your LLM on copyrighted material against the law?

LLMs are everywhere now, for better or worse (mostly worse). They are mostly used for creating just-good-enough garbage. This has been possible for some time now – back in 2015 I trained a Markov chain on a bunch of Wine Spectator reviews and had it spit out wine reviews with Amazon affiliate links, back when you could buy wine on Amazon. The site is still up and still makes me laugh. I suppose ChatGPT is a little better at this now, but not orders of magnitude better.

A lot of things about LLMs are not interesting. They lie confidently. They’ve contributed to the utter uselessness of web search. They make it trivial to create content that search engines like but no one else has any use for. I find all that disturbing but not intellectually stimulating. What DOES interest me, however, is the copyright angle.

Assuming the LLM doesn’t spit out plagiarized work, I do not believe that training your LLM on copyright material is in any way a violation of any current US law. I am not a lawyer, and I am open to be proven wrong here, but it just doesn’t make sense. We are assuming that you are legally accessing these copyrighted works because the scenario is no longer interesting if you are, for example, downloading a torrent of every Random House book ever published and letting your LLM ingest all of that. Current law clearly covers that.

Let’s say I want to be an author. I’m into horror, so I read a bunch of horror books to hone my craft. I read a lot of Stephen King, and now a lot of what I write is pretty heavily influenced by his style. He’s a pretty successful author, so this isn’t really a bad thing. Now, no one would consider this copyright infringement, right? Every author ever is influenced by what they read. It’s one of the first things they teach you in writing class – go read more.

So what is the difference between me reading Stephen King and bits of his style creeping into mine, and an LLM reading EVERYONE, and bits of their style creeping into its writing? The only difference is volume. And there’s nothing in copyright law that says “doing this once is fine, doing it 100,000 times is a violation”.