The December 2022 issue of IEEE Spectrum is here!

Close bar

How to Protect Privacy While Mining Millions of Patient Records

A new COVID-19 data-science platform backed by the U.K. National Health Service aims to provide secure access to patient data during the coronavirus pandemic

5 min read
A patients confidential medical record
Photo: iStockphoto

When people in the United Kingdom began dying from COVID-19, researchers saw an urgent need to understand all the possible factors contributing to such deaths. So in six weeks, a team of software developers, clinicians, and academics created an open-source platform designed to securely analyze millions of electronic health records while protecting patient privacy.

The new OpenSAFELY analytics platform enabled the largest study yet of hospital deaths related to COVID-19 among more than 17 million adult patients in the U.K.’s National Health Service (NHS). It also shows how large-scale computational power can speedily access and analyze patient information during a public-health emergency, without removing sensitive information from data centers belonging to software companies that maintain electronic health records.

With the new platform, only a small band of trusted analysts ever gains direct access to patient data, and all database queries are logged for security reasons. “The key barrier to accessing that scale of data, historically, was principally that we weren’t handling the security and privacy issues in a way that was compatible with meeting patients’ entirely reasonable requirements for privacy,” says Ben Goldacre, a physician and director of DataLab at the University of Oxford.

As joint principal investigator for the OpenSAFELY project, Goldacre has wanted to make research on electronic health records (EHRs) more efficient for years. The new study presented in a preprint paper posted on medRxiv on 6 May 2020 represents the first fruits of his team’s labor, but such findings won’t be the last. Because OpenSAFELY can securely work with any large NHS data set, the team is already running new analyses to identify which treatments and drugs increase or decrease the risks for COVID-19 patients and to better understand and predict how the disease spreads.

Traditionally, researchers have had to pay large amounts of money to extract relatively limited samples of electronic health records from the original database. This extraction approach is also problematic from a security and privacy perspective because many studies have shown that it’s still possible to reidentify patients from pseudonymized records that remove the person’s name, birth date, and detailed aspects of the home address.

“The problem is calling health records pseudonymized or anonymized is never guaranteed privacy,” says Erman Ayday, a security and privacy researcher at Case Western Reserve University in Cleveland, who was not involved in OpenSAFELY. “If you have some background information about the individuals, you can easily reidentify some of them in the anonymized records.”

Such heavy reliance on pseudonymization and older security practices seemed insufficient when the pandemic hit home for the United Kingdom and researchers needed access to big data in a timely manner in order to better understand the novel coronavirus. That’s why Goldacre and his team—spread across the University of Oxford, the EHR group at the London School of Hygiene and Tropical Medicine, and electronic health-record software companies such as TPP—volunteered their services to the U.K. National Health Service to figure out a secure solution for data mining the health information of millions of people.

“We knew it wouldn’t be acceptable, or it certainly wouldn’t be acceptable to us, with our high standards around security, to be anything like an extraction service where you send the data off and nobody knows what happens to it next,” Goldacre says. In such services, he adds, “there are no logs kept and you trust people—you evaluate that they’re trustworthy—but fundamentally it relies on trust, not proof.”

Although they didn’t realize it before the pandemic, Goldacre and his colleagues have effectively been preparing for this project for the last four years. Many had previously worked together on projects such as OpenPrescribing, an online tool that allows patients and clinicians alike to track changes in drug prescription patterns across the NHS. Along the way, they’ve produced both academic papers and data-science tools and services.

[shortcode ieee-pullquote quote=""We bring the analytics to the place where the data is already being kept securely for routine care."" float="left" expand=1]

From the start, the team knew they needed a more secure way to handle patient information than extracting data from electronic health records databases. So they built the OpenSAFELY analytics platform to run inside the security-certified data centers where the electronic health records already reside.

“We bring the analytics to the place where the data is already being kept securely for routine care,” Goldacre says. “And then in addition to that, we add on several extra layers of security.”

The records contain all the pseudonymized primary care data for each patient, such as when a pseudonymous person in a pseudonymous location was given a prescription for a particular drug at a particular dose for seven days. This event-level data access to patient information is restricted to the very small group of trusted data analysts who do the work of using standard SQL queries to pull any relevant data for a specific study. Every single query against the database is logged here, Goldacre says, so that no one can get away with unethical practices such as trying to stalk ex-partners using patient records.

The next tier of access consists of a patient-level database, where each pseudonymous patient’s entry holds a small amount of information about features relevant to a specific study. This is where researchers can run statistical analyses to better understand factors that may have contributed to the survival of COVID-19 patients. And the only patient information that ever leaves the data center consists of summary tables from those analyses.

OpenSAFELY’s logging of all the statistical requests represents one good security approach, Ayday says. To further reduce risk, he suggested that such systems could use differential privacy techniques that add random noise to summary statistics in order to make it more difficult to reidentify patients.

“The good thing about OpenSAFELY is the people who have access to these data sets are very few and from the beginning they’re planning to provide very controlled access by keeping these logs and everything, so that’s a good thing,” Ayday says.

The security approach of performing all data analysis within preexisting data centers also sidesteps the logistical issue of trying to extract and transmit huge amounts of data via a secure online network connection. The latter could prove especially problematic when trying to transmit data about potentially millions of people.

This project may be the latest example of how the pandemic forces health care systems to finally make changes long overdue.

“This is the scale of data where getting it down a huge pipe from one place to another would itself be so painfully slow that, even during [the] COVID-19 lockdown, you’d be better off getting in your car with some hard drives on the passenger seat [to extract the data],” Goldacre says.

The team is updating the OpenSAFELY platform so that it won’t even require anyone to run SQL queries on the event-level database. Instead, the updated platform will take anyone’s analytical query—written in standard statistical software such as R or Stata—and automatically pull the relevant data from the event-level database to generate the patient-level data for analysis. After performing the analysis, the platform would give the results back to the requester.

As an added level of transparency and accountability, all future users would have to openly deploy their code and statistical analyses from GitHub. That means people could see if anyone attempted to run a problematic statistical analysis on the records.

The OpenSAFELY team’s GitHub repository currently holds more than 45,000 lines of open-source code available for anyone to review, modify, and reuse. Because it was designed to be portable software, it could run against primary care data in many other databases. “We want it to be stored in the most secure place, which to our minds right now is the data center of the [electronic health records] vendor,” Goldacre says. “But if the NHS ever produced its own data store, we could point it to work there.”

Much of OpenSAFELY’s security framework is not necessarily new from Ayday’s standpoint. And he warned that any serious data breach of the data center holding the electronic health records—something that isn’t uncommon in the world of health care—could still compromise patient privacy.

But despite all the security challenges, he views such efforts as worthwhile as long as researchers take steps to minimize the privacy risks. “What these guys are doing is beneficial, definitely, because they are trying to identify the statistics related to this virus and hopefully it’s going to give some valuable results,” Ayday says.

This project may be the latest example of how the pandemic forces health care systems to finally make changes long overdue. Goldacre says he welcomes queries from anyone outside the United Kingom who might want to adapt the platform for their own uses in analyzing health data.

“We had to take the opportunity to build—well, I sound like a character from ‘Silicon Valley’ when I say this—to build a better future for computational data science in health care,” Goldacre says.

The Conversation (0)

Why the Internet Needs the InterPlanetary File System

Peer-to-peer file sharing would make the Internet far more efficient

12 min read
An illustration of a series
Carl De Torres

When the COVID-19 pandemic erupted in early 2020, the world made an unprecedented shift to remote work. As a precaution, some Internet providers scaled back service levels temporarily, although that probably wasn’t necessary for countries in Asia, Europe, and North America, which were generally able to cope with the surge in demand caused by people teleworking (and binge-watching Netflix). That’s because most of their networks were overprovisioned, with more capacity than they usually need. But in countries without the same level of investment in network infrastructure, the picture was less rosy: Internet service providers (ISPs) in South Africa and Venezuela, for instance, reported significant strain.

But is overprovisioning the only way to ensure resilience? We don’t think so. To understand the alternative approach we’re championing, though, you first need to recall how the Internet works.

Keep Reading ↓Show less