Derivative Work Status of Large Language Models

Determine whether large language models trained on text datasets constitute derivative works of their training data and clarify the applicable legislation and case law governing this classification, as this determination directly impacts the permissibility of using various Creative Commons–licensed materials for model pre-training.

Background

Within the Law Perspective, the authors adopt a conservative stance on licensing because model weights may transform or memorize training content. They note that if LLMs are considered derivative works, materials under Non‑Derivative, Non‑Commercial, or Share‑Alike Creative Commons licenses may be unsuitable for commercial model training.

This uncertainty motivates excluding certain licenses from the corpus and highlights the need for legal clarity, since the derivative-work determination governs what content can lawfully be included in pre-training datasets.

References

The discussion on whether LLMs should constitute a derivative work (a transformed version) of their training dataset is yet unresolved and legislation and case law is currently unclear.

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training  (2604.00920 - Oort et al., 1 Apr 2026) in Section 3.2 (The Law Perspective), paragraph “LLM as derivative work”