One of his former employees – who left in August – estimated that these complaints are justified. He brought a form of demonstration backed by information theory. Central question: does ChatGPT training fall under the doctrine of fair use (reasonable use)?
- Purpose and character of the use (in particular if it is of a commercial nature or has a non-profit educational objective)
- Nature of the protected work
- Quantity and importance of the portion used in relation to the work as a whole
- Effect of this use on the value of the work or on its potential market
The ChatGPT effect by Stack Overflow example
On the last point, it is difficult to comment directly, as ChatGPT training data is not public. Studies have nevertheless attempted to measure the phenomenon. The ex-OpenAI employee mentions one in particular, published in May 2024 and signed by three researchers from Boston University. She concludes that several indicators at Stack Overflow declined after the release of ChatGPT. Among other things, traffic, question volumes and the number of new user accounts.
Concerning the purpose and character of the use, it is essentially a question of determining whether there is a substitution effect. This is how American justice reasoned in the 19th century, in the case which laid the foundations of the doctrine of fair use. The defendant had copied fragments of a biography of George Washington to write his own version. It was then decided that the use – even exhaustive – of a protected work could fall under the fair use as long as it had a “fair and reasonable” cove purpose, without the derivative replacing the original work.
Google Books and the substitution effect
In recent litigation, this factor has been examined from the angle of “transformativeness”, which could be translated as “degree of transformation”. Example for Google Books. In this case, it was determined that the practices complained of (digitization of books, creation of a search functionality and display of extracts) were for a “very transformative” purpose and that the published elements did not constitute a significant substitute to the protected aspects of the original work.
In 2023, the American Supreme Court clarified the notion of “transformativeness”, calling for it to be balanced with the commercial nature of the use of the work.
ChatGPT is a commercial product, but does it have a similar purpose to its training data? In practice, it is difficult to estimate this purpose for a product with such a transversal vocation. We can nevertheless reframe the problem by asking whether ChatGPT has a direct substitution effect… or indirect, like a negative film review which would impact cinema admissions.
For the former OpenAI employee, there is indeed a direct substitution. He gives the example of Stack Overflow and the drop in traffic correlated with the launch of ChatGPT, so the answers are not necessarily similar, but have the same purpose.
Information theory applied to GenAI: a question of entropy
As for the factor “quantity and size of the portion used”, two interpretations are possible. One from the angle of the inputs, which are copies of the original works… and which therefore cannot benefit from the fair use. The other from the angle of outputs, which, on the contrary, are not always copies of the original works.
In the context of generative AI, estimate H(Y) (marginal entropy of outputs over all datasets possible) is almost impossible. Estimate H(X) (entropy of dataset training) is difficult, but not impossible. Given how generative model training works, we can assume that H(Y) = H(X).
Intuitively, low entropy outputs are more likely to include data from the dataset training. In practice, developers tend to favor this type of output in order to limit hallucinations. Among other things, by repeating data during training and performing reinforcement learning.
Illustration