Is training your LLM on copyrighted material against the law?

LLMs are everywhere now, for better or worse (mostly worse). They are mostly used for creating just-good-enough garbage. This has been possible for some time now – back in 2015 I trained a Markov chain on a bunch of Wine Spectator reviews and had it spit out wine reviews with Amazon affiliate links, back when you could buy wine on Amazon. The site is still up and still makes me laugh. I suppose ChatGPT is a little better at this now, but not orders of magnitude better.

A lot of things about LLMs are not interesting. They lie confidently. They’ve contributed to the utter uselessness of web search. They make it trivial to create content that search engines like but no one else has any use for. I find all that disturbing but not intellectually stimulating. What DOES interest me, however, is the copyright angle.

Assuming the LLM doesn’t spit out plagiarized work, I do not believe that training your LLM on copyright material is in any way a violation of any current US law. I am not a lawyer, and I am open to be proven wrong here, but it just doesn’t make sense. We are assuming that you are legally accessing these copyrighted works because the scenario is no longer interesting if you are, for example, downloading a torrent of every Random House book ever published and letting your LLM ingest all of that. Current law clearly covers that.

Let’s say I want to be an author. I’m into horror, so I read a bunch of horror books to hone my craft. I read a lot of Stephen King, and now a lot of what I write is pretty heavily influenced by his style. He’s a pretty successful author, so this isn’t really a bad thing. Now, no one would consider this copyright infringement, right? Every author ever is influenced by what they read. It’s one of the first things they teach you in writing class – go read more.

So what is the difference between me reading Stephen King and bits of his style creeping into mine, and an LLM reading EVERYONE, and bits of their style creeping into its writing? The only difference is volume. And there’s nothing in copyright law that says “doing this once is fine, doing it 100,000 times is a violation”.