In a preview of its forthcoming report on copyright and artificial intelligence, the U.S. Copyright Office has unveiled a pre-publication draft of the report’s section on generative-AI training. The draft reflects a concerning tendency toward uncertainty and overreach, giving short shrift to the substantial arguments in favor of AI developers and deployers and notably discounting significant public benefits, while broadly construing market harms in ways that risk stifling technological progress.
Given both the office’s institutional incentives and its historical focus, this is perhaps unsurprising. It’s also important to note that, amid the office’s recent political struggles, the draft may yet be withdrawn, significantly modified before final publication, or simply not finalized at all.
With that said, the views the office expresses do echo some recent court findings. The draft report also walks through each of the fair-use factors in considerable detail, providing nuanced and sometimes controversial interpretations. This post will endeavor to highlight some key points that stood out as particularly significant.
Maybe GenAI Is Transformative…Sometimes?
The report adopts a fragmented analytical approach, suggesting that each distinct part of generative AI—from initial training to fine-tuning to retrieval-augmented generation (RAG) to deployments—requires its own fair-use analysis for each instance (e.g., each instance of training and fine tuning for each release of ChatGPT, as well as each deployment of ChatGPT as unique products). Specifically, the report notes that training AI on copyrighted material will “often” be transformative, but then offers important caveats:
[T]ransformativeness is a matter of degree, and how transformative or justified a use is will depend on the functionality of the model and how it is deployed. On one end of the spectrum, training a model is most transformative when the purpose is to deploy it for research, or in a closed system that constrains it to a non-substitutive task….
On the other end of the spectrum is training a model to generate outputs that are substantially similar to copyrighted works in the dataset. For example, a foundation image model might be further trained on images from a popular animated series and deployed to generate images of characters from that series. Unlike cases where copying computer programs to access their functional elements was necessary to create new, interoperable works, using images or sound recordings to train a model that generates similar expressive outputs does not merely remove a technical barrier to productive competition. In such cases, unless the original work itself is being targeted for comment or parody, it is hard to see the use as transformative.
The report nods to the recent “Studio Ghibli” AI trend, and suggests that such uses are not transformative. But note how the report characterizes the use. It begins the second part of the quoted language by noting that training a model to generate outputs “substantially similar” to existing work isn’t transformative. This is an obvious restatement of copyright law—if you make a product whose purpose is to infringe copyright, the manner in which it infringes copyright can’t be transformative.
But the report makes a subtle move that dramatically expands what counts as “substantially similar.” It argues that “to train a model that generates similar expressive outputs does not merely remove a technical barrier to productive competition.” Which is to say, as in the recent “Studio Ghibli” craze, making models capable of generating “similar expressive outputs” won’t be transformative. But this phrasing suggests a much larger analytical frame—something closer to an artist’s “style” or a writer’s “voice,” and not so much concern over outputs that are “substantially similar” to particular fixed works. Indeed, in its footnote to this language, the report says:
The decision to train on expressive works when there are available alternatives may itself reflect a lack of transformative purpose. For example, an image model could be trained on mass image data collected through automated means (street-view cars, body cameras, security cameras), yet developers often choose aesthetic images such as stock photography. This suggests the purpose is not simply to generate images of the physical world, but to generate images that have expressive qualities like the originals.
The report is concerned with AI models trained on images for their aesthetic content because the models learn to create works with those aesthetics. Again, the footnote doesn’t claim the concern is the reproduction of particular expressive works but instead the creation of new works that “have expressive qualities like the originals.”
How far does this logic extend? And does it extend to human works, as well? The Black Crowes certainly borrowed quite a lot of the “expressive qualities” of the Rolling Stones, and Stephen King has widely admitted to wearing influences like H.P. Lovecraft on his sleeve. As Picasso is purported to have said: “good artists borrow, great artists steal.”
Copyright protection does not extend to so-called “unprotectable elements,” such as facts, ideas, concepts, processes, systems, methods, and general themes common to a wide variety of works. Indeed, the report’s language would appear to undermine the scènes à faire defense to copyright infringement.
While there’s room for nuanced distinctions between clearly transformative uses (such as foundational training for new kinds of outputs or research-oriented models) and less transformative applications (such as generating similar expressive content), this level of fragmentation for determining fair use is impractical. It injects substantial uncertainty into AI development. A clearer delineation between foundational training (where copying vast datasets can be transformative) and deployment-level uses designed to infringe might yield more predictable and economically beneficial outcomes.
The ‘Effect on the Market’ Swallows Everything
The Copyright Office’s most troubling analytical leap involves its expansive view of the fourth fair-use factor: the “effect on the market.” By interpreting potential AI-generated outputs as direct substitutes for original works, the report threatens to substantially restrict innovation.
According to the report:
While we acknowledge this is uncharted territory, in the Office’s view, the fourth factor should not be read so narrowly. The statute on its face encompasses any “effect” upon the potential market. The speed and scale at which Al systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data. That means more competition for sales of an author’s works and more difficulty for audiences in finding them. If thousands of Al-generated romance novels are put on the market, fewer of the human-authored romance novels that the Al was trained on are likely to be sold. Royalty pools can also be diluted. UMG noted that “[a]s Al-generated music becomes increasingly easy to create, it saturates this already dense marketplace, competing unfairly with genuine human artistry, distorting digital platform algorithms and driving ‘cheap content oversupply’ – generic content diluting human creators’ royalties.”
The concern expressed here is not even that a model’s given output will affect the market for some particular piece of content by creating a substantially similar copy. The concern is that the market for creative works as a whole will be affected by generative AI. The report at least acknowledges the novelty of this position.
Further, this framing misconstrues the nature of generative-AI outputs, which typically serve different consumer needs and interests from the original expressive works. AI-generated writing, for instance, can support entirely different uses and markets than original journalistic or literary creations. For example, using generative AI to assist in creating original news content, by definition, cannot infringe on an original expression, as the intended output (a news story on an event that just occurred) cannot possibly exist yet. The report essentially construes the “effect on the market” as an argument about labor inputs in a production process, and not a matter of whether an original expression has been infringed.
The report imagines a certain state of the market and wants to freeze it in place. While it’s true that generative AI can be used to hastily create new romance novels or pop songs, it’s also true that AI will help with curation, allowing users to locate higher quality works in less time. This, however, is left unsaid in the report, and probably for good reason. Predicting all of the complicated market opportunities and responses that AI will create is a task far beyond the scope of copyright analysis. All of these are valid social concerns that should be discussed. But it is not a matter for copyright law to balance those sorts of equities in the big picture.
Model Weights as Infringement
The report’s position on model weights as potentially infringing copies is similarly problematic. Equating AI models’ learned patterns to mere copies oversimplifies the sophisticated statistical and transformative processes at-play, akin to conflating human memory or learned skills with unlawful reproduction. Such a stance stretches copyright protection beyond reasonable limits and misunderstands the underlying technology. As the report itself acknowledges, “the extent to which models retain or ‘memorize’ training data…was disputed by commenters.” But the report nonetheless suggests, “where the learned pattern is highly specific, ‘the pattern is the memorized training data.’”
This broad interpretation poses a substantial risk to AI innovation, as it fundamentally misunderstands how generative-AI models operate. Model weights do not function as databases or forms of digital compression that store and reconstruct specific copyrighted works. Rather, these weights reflect complex statistical relationships regarding how likely certain tokens—such as fragments of words or phrases—are to appear together, based on vast patterns extracted from human-generated content.
Moreover, equating model weights with databases or compressed copies conflates correlation with reconstruction. The trained models capture statistical probabilities about token arrangements, revealing patterns of usage and structure prevalent in large bodies of text. When generative models produce outputs, they synthesize text based on these statistical insights, rather than reconstructing original expressions from specific sources. Therefore, characterizing these models as repositories of copied works overlooks the transformative, statistical, and probabilistic nature inherent to their operation. This approach could potentially impose inappropriate legal constraints on technological development and innovation.
It is true that generative-AI models can sometimes produce outputs that closely resemble existing works. But these outputs are fundamentally new generations, rather than copies stored within the model. While certain outputs might occasionally implicate specific copyrights, the underlying model weights themselves are neither reproductions nor infringements of any protected work.
The Public Benefits of AI
The report’s minimal acknowledgment of potential public benefits from generative AI is inadequate. While it nods perfunctorily to the existence of public benefits, it quickly pivots to counterbalance this acknowledgment:
In the Office’s view, there are strong claims to public benefits on both sides. Many applications of generative AI promise great benefits for the public, as does the production of expressive works. While the sheer volume of production itself does not necessarily serve copyright’s goals, commenters identified a wide range of potential benefits weighing in favor and against training on unlicensed copyrighted works. With regard to the fair use analysis, however, the Office cannot conclude that unlicensed use of copyrighted works for training offers copyright-related benefits that would change the fair use balance, apart from those already considered.
The fact of the matter is that training these large language models and other forms of generative AI is not directed solely toward the purpose of writing poems or creating pictures. The iterative process of training and deployment allows researchers to discover both how to develop these systems at-scale and which uses may generate the most social benefit. We can easily imagine how training AI at-scale on medical images and photos of patients might help these systems recognize early warning signs of various diseases. What we can’t imagine is what happens when users and entrepreneurs begin to employ these systems in unexpected ways. The public benefit of facilitating training is hard to overstate.
It is, of course, also true that we do not want to hollow out the creative industries. But as the report itself notes (and the Copyright Office has previously recognized), the scale of licensed works that would be needed to facilitate AI training means that the economics just don’t make sense. Instead, in what is likely a question for Congress, we may need to think about ways to enable creators to better leverage their unique styles for different sorts of model outputs. For example, general models fine tuned to generate works in the style of (or with the voice of) particular creators could provide unique monetization opportunities. And that may require a new form of property right related to (but distinct from) copyright.
Conclusion
The report’s posture echoes a worrying trend evident in recent cases like Ross v. Thomson Reuters and ongoing speculation around Khadrey v. Meta, which would mean a rather restrictive interpretation of fair use. If courts uniformly embrace this narrow approach, the resulting landscape could significantly hinder AI innovation. In that event, legislative intervention or appellate-court clarification may become necessary to recalibrate the balance.
Ultimately, while some level of concern about protecting rights holders is justified, the Copyright Office’s report introduces too much uncertainty and recommends overly broad restrictions. Policymakers and appellate courts should reaffirm the need for flexibility in the face of crucial innovation. Ensuring a balanced fair-use doctrine requires recognizing genuine differences between generative-AI tools and original expressive works in terms of their output, intent, and market impact. Without this balance, the legal framework risks becoming an obstacle, rather than an enabler, of technological advancement and economic growth.