Large language models feed on big data from publicly available training sets, but most of the sets are of doubtful legal status.
The scope of the problem has been demonstrated by the newly launched Data Provenance Initiative, which brings together a multi-institutional team of machine-learning and legal experts led by researchers at the Massachusetts Institute of Technology and Cohere for AI, a nonprofit research lab created by the AI company Cohere.
The group audited more than 1,800 widely used datasets for training natural language processing models that are publicly available on websites like GitHub, Hugging Face, and Papers With Code. They found more than 70 percent had no data licenses, and for those that did, roughly half were incorrect. What’s more, 29 percent of the incorrect licenses were more permissive than the dataset creators had intended.
GitHub, Hugging Face, and Papers With Code did not respond to a request for comment.
“Practitioners who want to do the right thing can’t because the data provenance of even these most widely used datasets is uncertain,” says Sara Hooker, head of Cohere for AI and coauthor of a research paper describing the audit, which has yet to be peer reviewed. “This is, for me, the part that is most unsettling about what we discovered, which is that practitioners are being given inconsistent and sometimes completely unreliable information on widely used providers that position themselves as the home of datasets.”
There have been growing concerns about the sourcing of AI training data, which is often indiscriminately scraped from the Web. Tech companies are facing legal cases brought by both artists and authors, who claim that using their work for model training infringes their copyright. The Atlantic magazine has created a searchable database that lets authors see whether their work has been included in AI datasets. A class action lawsuit has also been filed against Google for using social-media posts that have been scraped without permission to train its large language models.
The Data Provenance Initiative focused on datasets used for instruction fine-tuning, the process of retraining large models to better follow natural-language instructions in particular domains. They analyzed 1,858 of the most popular open-source datasets using both manual and automated approaches, including GPT-4 to help analyze the underlying data.
As well as gathering licensing information, the audit compiled information on the original source of the text, the dataset’s creators, and other characteristics such as language, task categories, and text topics. The team also created a free online tool called the Data Provenance Explorer, which allows developers to trace the provenance of datasets and filter for different licensing conditions.
Licensing information was unspecified for 72 percent of the sets analyzed on GitHub and Hugging Face and for 83 percent on Papers With Code. But during the audit, the researchers found that many of these datasets did in fact have licenses, and by the end of the process only 31 percent of the 1,800 analyzed had no obvious licensing information. Even when licenses were mentioned on data-hosting sites, the team found they were more permissive than the original creator intended—29 percent of the time on GitHub, 27 percent on Hugging Face, and 16 percent on Papers With Code.
Attaching licenses is the responsibility of the person who uploads the dataset. If it isn’t done properly, whoever trains a model on the dataset could unwittingly breach license restrictions, such as those limiting commercial use or requiring credit for the creators of the dataset.
A big part of the problem, says Hooker, is that many publicly available collections are actually compilations of lots of smaller datasets. Often those collating the data attach a single license to the final dataset, even when its component datasets have different requirements. These collections are sometimes compiled into even bigger collections, she says, compounding the problem.
“It’s just been packaged and repackaged so many times that it’s really hard for a practitioner to discern, ‘Am I using this responsibly?’ ” says Hooker. “When you start to scrape below the surface, what you’re finding is that it’s got conflicting licenses, and some datasets probably shouldn’t have been included at all.”
The scale of the problem suggests there is a pervasive culture within the AI community that doesn’t take data provenance and licensing seriously, says Hooker. There is considerable ambiguity as to how copyright law applies to data used for training AI, and different countries are likely to reach different conclusions. It’s also difficult to tell when data has been used to train a model, says Hooker, so there’s little scope for tracing when licenses have been breached.
The initiative provides useful results, says Paolo Missier, a professor of big-data analytics at Newcastle University, in England. But he quibbles with the researchers’ use of the term “data provenance,” when the main focus of the audit is on licensing.
Data-provenance research normally seeks to reconstruct the entire history of a dataset and all the changes and transformations it has undergone. The goal is to provide users with the information required to assess how reliable the data is, whether it’s subject to any bias and whether it’s appropriate to use in a particular context. “A license is important metadata. It’s not provenance,” says Missier.
Karl Werder, an assistant professor at the University of Cologne, in Germany, agrees that the initiative has taken a somewhat narrow approach to the topic of provenance. But he thinks the issue of licensing is an important one and the lookup tool the team has built could be very useful for AI developers. “It’s great work,” he says. “I can only commend the authors for pulling this off and making this publicly available so other practitioners can benefit from it.”
At present, trying to establish the provenance of data is a major headache for developers, says Werder, and often involves lengthy digging. There has been promising work on standardized datasheets that provide answers to the key questions developers may have, but their adoption has been piecemeal so far and data-hosting sites have little incentive to require them.
“From a platform perspective, it’s a bit of a trade-off, because you’re increasing the hurdle to upload the dataset,” Werder says. “If you make it mandatory, that will probably decrease the use.”
- Are You Still Using Real Data to Train Your AI? ›
- Nvidia Is Piloting a Generative AI for Its Engineers ›
- Want to Keep AI From Sharing Secrets? Train It Yourself ›
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.