How GitHub Copilot may lead Microsoft into a copyright crisis
REPORT SPECIAL GitHub Copilot, a programming auto-suggestion tool trained on publicly available source code on the internet, was caught generating what appears to be copyrighted code, prompting an attorney to investigate a possible copyright infringement claim.
On Monday, Matthew Butterick, a lawyer, designer, and developer, announced that he is investigating the possibility of filing a copyright claim against GitHub with Joseph Saveri Law Firm. There are two possible lines of attack here: is GitHub incorrectly training Copilot on open source code, and is the tool incorrectly emitting other people's copyrighted work - pulled from the training data - to suggest code snippets to users?
Butterick has been a vocal critic of Copilot since its inception. He wrote a blog post in June arguing that "any code generated by Copilot may contain lurking license or IP violations" and should thus be avoided.
That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) announced that their organization would discontinue use of GitHub, owing largely to Microsoft and GitHub's release of Copilot without addressing concerns about how the machine-learning model dealt with various open source licensing requirements.
Copilot's ability to copy code verbatim, or nearly so, was discovered last week when Tim Davis, a computer science and engineering professor at Texas A&M University, discovered that when prompted, Copilot would reproduce his copyrighted sparse matrix transposition code.
When asked to comment, Davis said he'd rather wait until he hears back from GitHub and its parent company, Microsoft, about his concerns.
Butterick told The Register in an email that the news of his investigation has gotten a lot of attention.
He wrote, "Clearly, many developers have been concerned about what Copilot means for open source." "There are a lot of stories being told. Our experience with Copilot has been similar to what others have discovered, in that it is relatively easy to get Copilot to emit verbatim code from identifiable open source repositories. We expect to see more examples as our investigation progresses.
"However, keep in mind that verbatim copying is only one of many issues raised by Copilot." A software author's copyright in their code, for example, can be violated without verbatim copying. Furthermore, the majority of open-source code is governed by a license, which imposes additional legal requirements. Has Copilot met these criteria? All of these issues are being investigated."
Microsoft and GitHub representatives declined to comment for this article. However, the documentation for Copilot on GitHub warns that the output may contain "unwanted patterns" and places the onus of intellectual property infringement on the user. That is, if you use Copilot to auto-complete code for you and are sued, you have been warned. That warning implies that the possibility of Copilot producing copyrighted code was not unexpected.
'Eager'
When GitHub released a beta version of Copilot in 2021, concerns about copyright and licensing arose. At the time, then-CEO Nat Friedman stated that "training ML systems on public data is fair use [and] the output belongs to the operator, just like with a compiler." We anticipate that intellectual property and artificial intelligence will be topics of interest in policy discussions around the world in the coming years, and we are eager to participate!"
In addition, GitHub-funded panel discussions about the impact of AI on open source have taken place at an event hosted by the Open Source Initiative, which is partly funded by Microsoft.
In an email, Kuhn from the SFC told The Register that statements by GitHub's now-former CEO that these copyright issues are settled law create a false narrative - a point he's made before.
"We've spoken with Microsoft and GitHub on this issue multiple times, and their unsupported anti-FOSS [free and open source software] stance has remained disturbingly consistent," he wrote. "We believe that Microsoft and GitHub have made the political calculation that if they keep repeating, early and often, that what they're doing is acceptable, they can make true what is not known to be true."
However, there is hope among those who find tools like Copilot useful that assistive AI can be reconciled with our social and legal frameworks. That the output of a model will not result in litigation.
"AI-assisted programming tools are not going away and will continue to evolve," said Brett Becker, assistant professor at University College Dublin in Ireland, in an email to The Register. Where these tools fit in the current landscape of programming practices, law, and community norms is still being explored and will continue to evolve.
"A fascinating question is: what will emerge as the primary drivers of this evolution?" Will these tools fundamentally alter future practices, laws, and community norms, or will our practices, laws, and community norms prove resilient, driving the evolution of these tools?"
The legal implications of large language models, such as OpenAI's Codex, on which Copilot is based, and text-to-image models built from datasets compiled by the German non-profit LAION, such as Imagen and Stable Diffusion, continue to be hotly debated topics. Similar concerns have been expressed about the images generated by Midjourney.
When asked if he believes large language models (LLMs) focused on generating source code are more prone to copyright violations due to the constrained nature of their output, Butterick declined to generalize.
"We've also looked into image generators; users have discovered that DALL-E, Midjourney, and Stable Diffusion all have different strengths and weaknesses." "The same will most likely be true for LLMs in coding," he predicted.
"These concerns have been raised about Copilot since it was first made available in beta. There will almost certainly be some legal issues that are common to all of these systems, particularly in the handling of training data. Again, we are not the first to raise these concerns. One significant distinction between open-source code and images is that images are typically distributed under more restrictive licenses than open-source licenses."
There are also unresolved social and ethical issues, such as whether AI-generated code should be considered plagiarism and whether creators of the materials used to train a neural network should have a say in how that AI model is used.
Mark Lemley, a Stanford law professor, and Bryan Casey, a Stanford law lecturer at the time, posed the question "Will copyright law allow robots to learn?" in the Texas Law Review in March 2021. They argue that it should, at least in the United States.
"[Machine learning] systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted," they wrote, adding that copyright law isn't the best tool for policing infringement.
However, when it comes to the output of these models – the code suggestions generated automatically by tools like Copilot – the potential for Butterick's proposed copyright claim appears to be stronger.
"I actually think there's a decent chance there's a good copyright claim," Tyler Ochoa, a law professor at Santa Clara University in California, told The Register over the phone.
According to Ochoa, there may be software license violations ingestion of publicly accessible code, but this is likely protected by fair use. While there hasn't been much litigation on the subject, a number of scholars have taken that stance, and he said he's inclined to agree.
Kuhn is less willing to overlook how Copilot handles software licenses.
"What Microsoft's GitHub has done in this process is completely unethical," he said. "They have declared that they know better than the courts and our laws about what is or is not permissible under a FOSS license without discussion, consent, or engagement with the FOSS community." They have completely ignored all FOSS license attribution clauses, and, more importantly, the more freedom-protecting requirements of copyleft licenses."
However, in terms of where Copilot may be vulnerable to a copyright claim, Ochoa believes that LLMs that output source code are more likely to echo training data than models that generate images. This could be a problem for GitHub.
"When you're trying to output code, source code, I think you have a very high likelihood that the code you output will look like one or more of the inputs," he said. "Once something works well, a lot of other people will copy it."
According to Ochoa, the output is likely to be the same as the training data for one of two reasons: "One is that there is only one good way to do it." The other is that you're basically copying an open source solution.
"OK, if there's only one good way to do it, that's probably not eligible for copyright." However, there is likely to be a lot of code in [the training data] that has used the same open source solution, and the output will look very similar to that. And that's just plagiarism."
In other words, the model may suggest code to solve a problem that has only one practical solution, or it may be a copy of someone else's open source that does the same thing. In either case, this is most likely due to a large number of people using the same code, which appears frequently in the training data, causing the assistant to regurgitate it.
Is that permissible? It's unclear. According to Ochoa, because the code is functional, reproducing it in a suggestion may not be considered particularly transformative, which is one of the criteria for determining fair use. Then there's the question of whether copying hurts the market when the market doesn't charge for open source code. Fair use may not apply if it harms the market.
"The problem here is that the market does not charge you money for these uses," said Ochoa, adding that the market is most interested in the terms of the open source licenses. "If a court believes those conditions are important, they will say, 'yeah, you're harming the market for these works by not complying with the conditions.'" [The software developers] are not receiving the attention that they desired when they created these words in the first place.
"As a result, they are not seeking monetary compensation." They want non-monetary compensation. And they don't get it. And if they don't get it, they'll be less likely to contribute open source code in the future. In theory, this hurts the market for these works or reduces the incentive to create them."
As a result, the generated code may not be transformative enough to be fair use, and it may harm the market as described, potentially jeopardizing a fair use claim.
When Berkeley Artificial Intelligence Research considered this issue in 2020, the group suggested that, given concerns about privacy, bias, and the law, training large language models from public web data may be fundamentally flawed. They proposed that instead of scouring the web, tech companies invest in better training data collection. That does not appear to have occurred.
Kuhn contends that the status quo must be changed, and that the SFC has been discussing Microsoft's GitHub with its litigation counsel for over a year.
"We are at a cultural crossroads that science fiction predicted in many ways," he said.
"Big Tech companies are attempting to impose their preferred conclusions about artificial intelligence applications on us in a variety of ways, regardless of what the law says or what values the community of users, consumers, and developers holds." FOSS, and Microsoft's GitHub's inappropriate exploitation of FOSS, is just one method among many. We have to confront Big Tech's behavior here, and we intend to do so."
When asked what the ideal outcome would be, Butterick said it's too early to tell.
"We don't know a lot about how Copilot works," he wrote.
"Certainly, we can imagine versions of Copilot that are more respectful of open-source developers' rights and interests." As things stand, it poses an existential threat to open source.
"It's obvious that GitHub, a company that built its reputation and market value on its deep ties to the open source community, would release a product that monetizes open source in a way that harms the community." On the other hand, given Microsoft's long history of hostility toward open source, perhaps it's not so surprising. When Microsoft purchased GitHub in 2018, many open source developers, including myself, hoped for the best. That hope, it appears, was misplaced."