Canada’s largest media companies, including the Globe and Mail, Toronto Star, Postmedia, CBC, and Canadian Press, came together last week to file a copyright infringement lawsuit against OpenAI, the owners of ChatGPT. The lawsuit is the first high profile Canadian claim lodged against the enormously popular AI service, though there have been similar suits filed elsewhere, notably including a New York Times lawsuit launched last year. While the lawsuit itself isn’t a huge surprise, the relatively weak, narrow scope of the claims discussed below are. Unlike comparable lawsuits, the Canadian media companies claim is largely limited to data scraping, which may be the weakest copyright claim. Moreover, the companies say they have no actual knowledge of when, where, or how their data was accessed, an acknowledgement that doesn’t inspire confidence when there is evidence available if you know where to look.
So why file this lawsuit? The claim is sprinkled with the most obvious reason: the Canadian media companies want a settlement that involves OpenAI paying licence fees for the inclusion of their content in its large language models and the lawsuit is designed to kickstart negotiations. The companies aren’t hiding the ball as there are repeated references along the lines of “at all times, Open AI was and is well aware of its obligations to obtain a valid licence to use the Works. It has already entered into licensing agreements with several content creators, including other news media organizations.” The takeaway is that Canadian media companies want to licence their stuff too, much like the licensing agreements with global media companies such as News Corp, Financial Times, Hearst, Axel Springer, Le Monde, and the Associated Press.
The push for licences may yet succeed, but this lawsuit on its own isn’t likely to ratchet up significant pressure. First, the Canadian claims are much narrower than those found in other lawsuits such as the NY Times case. Unlike the NY Times, which focused on both the inputs (the materials used to train ChatGPT) and the outputs (allegations ChatGPT occasionally provides copyright infringing results), the Canadian claim only target the inputs with no allegation that ChatGPT results are infringing. Given fair use protections afforded in the Google Books case, many believe that the input claims are the weakest part of the NY Times case with much more attention focused on the outputs that are claimed to produce actual matching text. There is plenty of debate about those claims too, but the Canadian media companies don’t even bother trying. They don’t claim that ChatGPT is producing results that infringe their work, only that the scraping of their work for inclusion in large language models without a licence infringes their copyright. Further, unlike the NY Times, they also don’t sue Microsoft, a major investor and user of ChatGPT, which also suggests that a licence from OpenAI is the real goal.
How Important Is Canadian News Content for AI Systems?
The claims that are alleged will face some significant headwinds. For one thing, the Canadian companies admit that they don’t actually know how much of their work is being used. Instead, they point to how much they have produced or licensed with the assumption that it is all scraped by OpenAI. The New York Times did some real digging into the use of its materials for AI training systems, whereas the Canadian companies don’t seem to know very much about what is actually taking place. This isn’t me speculating. The filing literally says “the full particulars of when, from where, and exactly how, the Works were accessed, scraped, and/or copied is within the knowledge of OpenAI and not the News Media Companies.”
While it is true that some of the specifics are difficult to discern, there is considerable publicly available information on just how much was used during parts of the period claim that starts in 2015. For example, we know that Common Crawl was the dominant source for by far the most number of tokens training ChatGPT 3.0. This comes directly from scientists at ChatGPT, who published on the issue in 2020. OpenAI created a filtered version of Common Crawl, but OpenAI didn’t itself actually scrape that data itself. Instead, Common Crawl, a non-profit started in 2007, did. The works from the NY Times amounted to 100 million tokens in that data set, which sounds like a lot but is actually a tiny fraction of the total. You can search which URLs were used so there is room to know what was included but the Canadian media companies seemingly didn’t bother doing so.
We also know about the data that went into GPT 2.0, which used WebText, a data set that traces back to 2019. The data was cut off in 2017, well within the range of the lawsuit. The top 1000 urls used in that data set is openly available here. The Canadian media companies do appear: the CBC ranks 21st, Toronto Star is 73rd, Globe is 78th, and National Post is 124th. Other Postmedia papers such as the Vancouver Sun (425), Ottawa Citizen (433), Calgary Herald (725), and Montreal Gazette (799) also make the list. Yet the total Canadian contribution is relatively small. Content from the NY Times, Washington Post, and BBC were each individually used more than all those Canadian media companies put together. In fact, blogging platforms such as Blogspot and WordPress were also each more used than all Canadian media companies combined. In other words, the totality of the Canadian materials included in the data set was less prominent than user content blogging platforms. And these ranking likely represent the high water mark of the use of Canadian media since as the data sets get even bigger, the percent of Canadian media content undoubtedly shrinks. That doesn’t change the copyright analysis, but it does have licensing implications since the relative importance of Canadian media content in AI systems is getting smaller as the field grows.
Doubts About the Copyright Claims
The copyright claims themselves will face a stiff challenge. Since there is no actual unauthorized publicly available publication of their works, the focus instead targets the scraping of the data for inclusion in training materials. As mentioned above, much of the data was not scraped by OpenAI. For the data it did scrape, the legal question will be whether that activity is permitted under Canada’s fair dealing rules. The data is used to create tokens for the statistical analysis that ultimately leads to the Chat GPT outputs. The scraping will certainly meet the first fair dealing hurdle by qualifying as research. The second stage six factor analysis – purpose, character, amount, nature, effect, and alternatives to the dealing – will address the token creation and use that will make for an interesting debate that is by no means certain. In fact, there will be arguments that the tokens derived from the underlying works involve statistical analysis rather than copying of those works.
The Canadian media companies try to buttress their claim by citing other violations, notably including circumvention of technological protection measures and violation of the media companies’ terms of use on their websites. The circumvention claims are very weak. For example, the companies argue that failure to abide by a robot.txt file, which can be used signal a request not to scrape data, violates the anti-circumvention rules. Yet a robot.txt file is more like a stop sign, not plausibly an effective technological protection measure. The companies also claim that OpenAI scrapes paywalled data, but there is no evidence offered to support the claim that paywalls were circumvented to access data. In fact, OpenAI says it does not by-pass paywalls. Interestingly, the claim does not cite the removal of rights management information, which might have provided a stronger claim if attribution was deliberately removed or concealed. Finally, the companies also point to their terms of use, which restrict certain activities, but this may only involve a breach of contract and not necessarily copyright infringement.
Much like Bill C-18, it would not surprise if coverage of the case sided with Canadian media companies. Indeed, there is certainly a need to consider the rights associated with inclusion of content in large language models. But this case is no slam dunk as it narrowly scopes the claims, fails to proffer much evidence at this stage, and faces some tough battles in interpreting Canadian copyright law. Thus far, many of the lawsuits elsewhere have largely failed: a U.S. judge recently dismissed a media case involving the removal of rights management information and a German court dismissed a claim on the inclusion of works in the LAION large language model. Rather than dreaming of billions in damages, it seems more like the companies hope that the case will provide the spark to reach a settlement that results in a new licensing deal.