Public AI Training Datasets Are Rife With Licensing Errors

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

Large language models feed on big data from publicly available training sets, but most of the sets are of doubtful legal status.

The scope of the problem has been demonstrated by the newly launched Data Provenance Initiative, which brings together a multi-institutional team of machine-learning and legal experts led by researchers at the Massachusetts Institute of Technology and Cohere for AI, a nonprofit research lab created by the AI company Cohere.

The group audited more than 1,800 widely used datasets for training natural language processing models that are publicly available on websites like GitHub, Hugging Face, and Papers With Code. They found more than 70 percent had no data licenses, and for those that did, roughly half were incorrect. What’s more, 29 percent of the incorrect licenses were more permissive than the dataset creators had intended.

GitHub, Hugging Face, and Papers With Code did not respond to a request for comment.

“Practitioners who want to do the right thing can’t because the data provenance of even these most widely used datasets is uncertain,” says Sara Hooker, head of Cohere for AI and coauthor of a research paper describing the audit, which has yet to be peer reviewed. “This is, for me, the part that is most unsettling about what we discovered, which is that practitioners are being given inconsistent and sometimes completely unreliable information on widely used providers that position themselves as the home of datasets.”

There have been growing concerns about the sourcing of AI training data, which is often indiscriminately scraped from the Web. Tech companies are facing legal cases brought by both artists and authors, who claim that using their work for model training infringes their copyright. The Atlantic magazine has created a searchable database that lets authors see whether their work has been included in AI datasets. A class action lawsuit has also been filed against Google for using social-media posts that have been scraped without permission to train its large language models.

The Data Provenance Initiative focused on datasets used for instruction fine-tuning, the process of retraining large models to better follow natural-language instructions in particular domains. They analyzed 1,858 of the most popular open-source datasets using both manual and automated approaches, including GPT-4 to help analyze the underlying data.

As well as gathering licensing information, the audit compiled information on the original source of the text, the dataset’s creators, and other characteristics such as language, task categories, and text topics. The team also created a free online tool called the Data Provenance Explorer, which allows developers to trace the provenance of datasets and filter for different licensing conditions.

Licensing information was unspecified for 72 percent of the sets analyzed on GitHub and Hugging Face and for 83 percent on Papers With Code. But during the audit, the researchers found that many of these datasets did in fact have licenses, and by the end of the process only 31 percent of the 1,800 analyzed had no obvious licensing information. Even when licenses were mentioned on data-hosting sites, the team found they were more permissive than the original creator intended—29 percent of the time on GitHub, 27 percent on Hugging Face, and 16 percent on Papers With Code.

Attaching licenses is the responsibility of the person who uploads the dataset. If it isn’t done properly, whoever trains a model on the dataset could unwittingly breach license restrictions, such as those limiting commercial use or requiring credit for the creators of the dataset.

A big part of the problem, says Hooker, is that many publicly available collections are actually compilations of lots of smaller datasets. Often those collating the data attach a single license to the final dataset, even when its component datasets have different requirements. These collections are sometimes compiled into even bigger collections, she says, compounding the problem.

“It’s just been packaged and repackaged so many times that it’s really hard for a practitioner to discern, ‘Am I using this responsibly?’ ” says Hooker. “When you start to scrape below the surface, what you’re finding is that it’s got conflicting licenses, and some datasets probably shouldn’t have been included at all.”

The scale of the problem suggests there is a pervasive culture within the AI community that doesn’t take data provenance and licensing seriously, says Hooker. There is considerable ambiguity as to how copyright law applies to data used for training AI, and different countries are likely to reach different conclusions. It’s also difficult to tell when data has been used to train a model, says Hooker, so there’s little scope for tracing when licenses have been breached.

The initiative provides useful results, says Paolo Missier, a professor of big-data analytics at Newcastle University, in England. But he quibbles with the researchers’ use of the term “data provenance,” when the main focus of the audit is on licensing.

Data-provenance research normally seeks to reconstruct the entire history of a dataset and all the changes and transformations it has undergone. The goal is to provide users with the information required to assess how reliable the data is, whether it’s subject to any bias and whether it’s appropriate to use in a particular context. “A license is important metadata. It’s not provenance,” says Missier.

Karl Werder, an assistant professor at the University of Cologne, in Germany, agrees that the initiative has taken a somewhat narrow approach to the topic of provenance. But he thinks the issue of licensing is an important one and the lookup tool the team has built could be very useful for AI developers. “It’s great work,” he says. “I can only commend the authors for pulling this off and making this publicly available so other practitioners can benefit from it.”

At present, trying to establish the provenance of data is a major headache for developers, says Werder, and often involves lengthy digging. There has been promising work on standardized datasheets that provide answers to the key questions developers may have, but their adoption has been piecemeal so far and data-hosting sites have little incentive to require them.

“From a platform perspective, it’s a bit of a trade-off, because you’re increasing the hurdle to upload the dataset,” Werder says. “If you make it mandatory, that will probably decrease the use.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Public AI Training Datasets Are Rife With Licensing Errors

An audit of popular datasets suggests developers face legal and ethical risks

Will Human Soldiers Ever Trust Their Robot Comrades?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

AI Chip Trims Energy Budget Back by 99+ Percent

Faster, More Secure Photonic Chip Boosts AI Training

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Public AI Training Datasets Are Rife With Licensing Errors

An audit of popular datasets suggests developers face legal and ethical risks

Will Human Soldiers Ever Trust Their Robot Comrades?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

AI Chip Trims Energy Budget Back by 99+ Percent

Faster, More Secure Photonic Chip Boosts AI Training