I asked ChatGPT to articulate the canonical case defending inclusion of copyrighted works in AI training sets on the grounds of Fair Use. Whether or not you agree, here’s what it came up:

The application of copyrighted materials in AI training data constitutes fair use. AI training doesn’t aim to duplicate copyrighted works but to extract and learn patterns from vast amounts of information (Bracha & Pasquale, 2021). The nature of AI learning, therefore, falls within the doctrine of transformative use, where new meaning or message is created, and the original work isn’t merely copied (Cohen v. Paramount Pictures Corp., 845 F.2d 851 (9th Cir. 1988)).

Furthermore, the impact on the potential market for the original work, another factor considered under fair use (Harper & Row Publishers, Inc. v. Nation Enters., 471 U.S. 539 (1985)), is negligible. AI doesn’t replicate or distribute the work, so the market for the original work remains unaffected.

Another fair use factor, the amount and substantiality of the portion used (Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984)), also lends support. AI doesn’t depend on the full meaning or unique creativity of individual works, but rather on a mass of data for pattern recognition.

In sum, copyrighted material’s use in AI training data can be justified under fair use, as the use is transformative, doesn’t harm the original market, and doesn’t depend on the unique creativity or meaning of individual works.

Please note that this argument represents a legal theory and not a definitive interpretation of the law.

(Source for case citations: U.S. Copyright Office, Fair Use Index, https://copyright.gov/fair-use/)

I have not tried to tease out the sources that it allegedly is citing, but the reason they are embedded in this response is I asked in my custom instructions at an account level that it always cite sources or indicate where none exist. It’s unreliable though, so take it with a grain of salt. I’m not sure whether citing potentially incorrect or invented sources is actually better than citing none at all.

Regardless of the alleged sources cited, I think the basic rationale sketched out here is probably sound. To summarize in my own words, it hinges on:

  • The aim is not to reproduce individual works in the training set (and this can be quite hard – though easier in some cases where a given work is over-represented in training, like the Mona Lisa)
  • The process is not “sampling” (essentially copying & pasting specific elements) from originals in training to make a hodge-podge final result composed of elements clipped from other pieces.
  • Instead, the training consists of taking a mass of data, and measuring dimensions (characteristics) within and across a vast amount of sources.
  • Since those sources are not reproduced verbatim, but radically changed in new and original ways through prompting, the use is not derivative but transformative

I think there’s some debate around whether or not this use via AI hurts the market for the originals, but it seems clear while I can prompt “in the style of Hemingway” or “in the style of Harry Potter,” I can’t actually use the system to reproduce in fullness one of those original works.

I asked ChatGPT to address this issue, since it comes up so commonly in rhetoric around these questions, and it wrote:

The system’s ability to generate content in the style of specific authors or artists isn’t a result of training on those individuals’ works specifically, but rather a byproduct of learning from a broad dataset and developing an understanding of various stylistic elements. When requested to generate content in a specific style, the system isn’t replicating any single work or infringing upon the moral rights of a creator. Instead, it’s using learned stylistic patterns and general knowledge of an author’s or artist’s style.

It’s also important to note that AI generation capabilities are an amalgamation of many influences, rather than the result of copying individual sources. The application of style in these cases can be seen more as an abstraction and generalization rather than a direct derivation from specific works (Hildebrandt, Mireille. “AI and Copyright: Authorship, Originality and Infringement.” (2021)).

Also, from a legal standpoint, it’s notable that copyright law doesn’t protect style. Copyright protects the expression of ideas, not the ideas themselves (Baker v. Selden, 101 U.S. 99 (1879)). This means that while the exact text of a Hemingway novel is protected by copyright, the general style in which Hemingway wrote is not.

(Source for information about copyright of ideas vs expression: U.S. Copyright Office, Copyright Basics, https://copyright.gov/circs/circ01.pdf)

Again, take those cited sources with a grain of salt (and investigate them).

This also reminds me of a thread I saw pop up in response to my Newsweek piece, where one user on a forum I thought did a fantastic job of breaking down the Fair Use defense:

“Fair use” legal doctrine permits limited use of copyrighted material without explicit permission from the copyright holder. Fair use often applies in cases of “transformative use” of materials. Training AI models on copyrighted materials can be considered transformative use – both the generated content and the representation of the content within the model itself is significantly different from the copyrighted material, often (in fact, almost always) incorporating content that wasn’t present in the copyrighted material. As long as the AI model isn’t used to directly replicate or distribute copyrighted material, it can fall within the boundaries of fair use.

I realize everything I’ve written above is debatable. That’s my point: it’s a plausible argument, and as OpenAI and Google and other companies haven’t been legally forced to scrub their training data, it means the argument is still open.

While fair use is a legal doctrine, I think it also serves as a decent ethical principle, in that it seeks to strike a balance between the rights of copyright holders and the broader public interest, promotes creativity and innovation, fosters information exchange, and demands that ethical (and legal) decisions take context into account rather than declaring a practice uniformly good or bad. There are certainly other ethical principles that can and should come into play, but I believe fair use is a decent starting point, and it certainly shouldn’t be overridden without a full consideration of the associated costs and benefits.

As for your hypothetical: yes, I would definitely be ok with the guy in the Newsweek story, or anyone else, writing and selling short stories that used an LLM where my published materials were part of the training set. Best of luck to him. I honestly can’t think of a single reason why I wouldn’t be ok with that, because he’s not lifting my words or passing them off as his own. That’s simply not how the tool works.

Nothing is “trained on my materials”. The model is trained on a huge corpus of text, and my stuff isn’t distinguished within that corpus. The model doesn’t even know what words are, let alone which stuff is mine. It breaks all materials from all sources into tokens, which you could roughly think of as grammatically-meaningful pieces of words, and learns relationships between tokens. The tokens from my work exist only as arrays of numbers that can be used to measure statistical distances to other tokens, and there is no set of numbers that is only derived from my work – my stuff is stored only in aggregate with other works, the grand majority of which are not copyrighted (because the grand majority of the scrape-able internet is not copyrighted). I don’t think it’s plausible to construe any of that as stealing.

While I happen to agree for the most part with this person’s argument, and it is basically supported by ChatGPT’s defense of Fair Use as well, I recognize that these kinds of technically-based arguments are not emotionally satisfying to those who feel that AI training and use of generative models is somehow extractive of what they perceive as the value they create in their work.

It also ignores the plain fact that there *is* fishy data laundering and use of non-profit or educational institutions as a shield against liability, which Waxy.org’s excellent post on the topic goes into in further detail.

It’s also worth throwing into this mix the fact that even though Adobe made a big noise about it’s generative Firefly system being fully licensed, there are a subset of affected creators who say they never agreed to their work being used for AI training & generation.

That said, there are a number of arguments in that camp, like this one quoted from VentureBeat that I think don’t completely add up:

“They’re using our IP to create content that will compete with us in the marketplace,” he said. “Even though they may legally be able to do that, because we all signed the terms of service, I don’t think it is either ethical or fair.” 

If it’s not illegal, what is the basis for the claim that it is not ethical or not fair?

I did ask ChatGPT about this topic of “moral rights” which to me is still a bit vague. It responded, in part (check these sources cited – I have not):

The assertion of moral rights over work involves creators’ interests in being recognized as authors and in preserving the integrity of their work (Berne Convention for the Protection of Literary and Artistic Works). However, in the context of AI training data, this may be counterbalanced by several considerations.

From a legal standpoint, U.S. law recognizes a limited version of moral rights through the Visual Artists Rights Act of 1990 (VARA). However, VARA applies only to visual art and doesn’t cover all types of copyrighted works. Thus, under current U.S. law, the broad assertion of moral rights over works used in AI training data may lack legal backing (U.S. Copyright Office, Visual Artists Rights Act).

Again, I don’t know how accurate those statements are. But then it launched into this, which is a common – and I think largely bullshit defense:

Ethically and morally, there’s a utilitarian perspective that the societal benefits of AI advancements can outweigh individual claims of moral rights. AI leverages vast amounts of data for significant purposes like medical research, climate modeling, and technological innovation. Restricting the use of copyrighted work in AI training could hinder such progress.

I’m in the camp of who gives a shit what’s ‘convenient’ for companies? Potentially “hindering” progress is not a valid excuse for not bothering to find a fair and equitable answer. We don’t all need to bow down to the market as being the highest value that governs our lives.

Personally, I don’t see why AI companies can’t get full permission from creators to include their work in training data. And it should be opt-in, not opt-out. And if you opt-in, there ought to be some way to track and be compensated for the use.

But the specifics of the *how* all that might work are, I think, absurdly complicated. If we look at Spotify as an example (which itself is not that fair or equitable to artists, imo), the minimum payment per stream is said to be something like $0.003. And that’s for whatever constitutes fully playing a track. That’s not comparable to how a generative AI licensing system might work, where say, only an infinitessimally small amount of any single given work might be referenced statistically based on its detected internal characteristics. If we were to peg a number to it, we might perhaps charitably say something like, okay 1/1000th of your source work is being referenced (in actuality, it’s probably far less). So what should the payment be to you? Something like $0.000003 per use? Let’s be realistic and say… it would probably be a “lot” lower. Anyway, also, what constitutes a use? Any generation that references an area of the latent space of that model which you helped train? If your source image included in the training was 1 out of 1,000,000 in the set of a hamburger, exactly what % of that is owed to you?

Maybe there’s a formula in some cases which would be relatively more clear cut than that, but given that each source image may represent hundreds or even thousands (or more) dimensions measured, which are then mashed together with hundreds or thousands of other images which are contiguous somehow to those dimensions…. well, let’s just say it’s really fucking complex to figure out. Maybe there’s a way, but just as a licensing platform like Spotify ends up being not very fair to creators, I wouldn’t hesitate to guess that industry will no doubt create a scheme which is equally shitty here, and then say that they have figured out compensation. When really, all we will have done is to recreate other power imbalances which are accepted already as “normal.”

I don’t say that to say we shouldn’t try. We probably should. We need to figure out a better method.

I do write all this, however, to show that it is objectively *not* a clear cut case as to 1) whether wrong has been done under the law (courts will be the ones to figure this out, not online opinion pieces), and 2) whether it’s possible to build a system which would make all of this make sense and create a good deal for everybody. I would guess that a better way is possible, but I’m not holding my breath that it’s going to be easy, or that creators will end up being the ones in charge of it. Sad, but probably true, if history is any indication.