During the COVID-19 pandemic, digital contact tracing apps based on the Bluetooth technology found in smartphones have been deployed by various countries despite the fact that Bluetooth’s baseline performance as a proximity detector remains mostly a mystery.
That is why the U.S. National Institute of Standards and Technology organized a months-long event that leveraged the talents of AI researchers around the world to help evaluate and potentially improve upon that baseline Bluetooth performance for helping detect when smartphone users are standing too close to one another.
The appropriately named Too Close for Too Long (TC4TL) Challenge has yielded a mixed bag for anyone looking to be optimistic about the performance of Bluetooth-based contact tracing and exposure notification apps. The challenge attracted a diverse array of AI research teams from around the world who showed how machine learning might help boost proximity detection by analyzing the patterns in Bluetooth signals and data from other phone sensors. But the teams’ testing results presented during a final evaluation workshop held on August 28 also showed that Bluetooth’s capability alone in detecting nearby phones is shaky at best.
“We showed that if you hold both phones in hand, you're going to get relatively better proximity detection results on this pilot dataset,” says Omid Sadjadi, a computer scientist at the National Institute of Standards and Technology. “When it comes to everyday scenarios where you put your phones in your pocket or your purse or in other other carrier states and in other locations, that's when the performance of this proximity detection technology seems to start to degrade.”
Bluetooth Low Energy (BLE) technology was not originally designed to use Bluetooth signals from phones to accurately estimate the distance between phones. But the technology has been thrown into the breach to help hold the line for contact tracing during the pandemic. The main reason why many countries have gravitated toward Bluetooth-based apps is that they generally represent a more privacy-preserving option compared to using location-based technologies such as GPS.
Given the highly improvised nature of this Bluetooth-based solution, it made sense for the U.S. National Institute of Standards and Technology (NIST) to assist in evaluating the technology’s performance. NIST has previously helped establish testing benchmarks and international standards for evaluating widely-used modern technologies such as online search engines and facial recognition. And so the agency was more than willing to step up again when asked by researchers working on digital contact tracing technologies through the MIT PACT project.
Repeating the same evaluation process during a global public health emergency proved far from easy. NIST found itself condensing the typical full-year cycle for the evaluation challenge down into just five months starting in April and ending in August. “It was a tight timeline and could have been stressful at times,” Sadjadi says. “Hopefully we were able to make it easier for the participating teams.”
But the sped-up schedule did put pressure on the research teams that signed up. A total of 14 groups hailing from six continents registered at the beginning, including seven teams from academia and seven teams from industry. Just four teams ended up meeting the challenge deadline: Contact-Tracing-Project from the Hong Kong University of Science and Technology, LCD from the National University of Córdoba in Argentina, PathCheck representing the MIT-based PathCheck Foundation, and the MITRE team representing the U.S. nonprofit MITRE Corporation.
The challenge did not specifically test Bluetooth-based app frameworks such as the Google Apple Exposure Notification (GAEN) protocol that are currently used in exposure notification or contact tracing apps. Instead, the challenge focused on evaluating whether teams’ machine learning models could improve on the process of detecting a phone’s proximity based on the combination on Bluetooth signal information and data from other common phone sensors such as accelerometers, gyroscopes, and magnetometers.
To provide the training data necessary for the teams’ machine learning models, MITRE Corporation and MIT Lincoln Laboratory staff members helped collect data from pairs of phones held at certain distances and heights near one another. They also included data from different scenarios such as both people holding the phones in their hands, as well as one or both people having the phones in their pockets. The latter is important given how Bluetooth signals can be weakened or deflected by a number of different materials.
“If you're collecting data for the purpose of training and evaluating automated proximity detection technologies, you need to consider all possible scenarios and phone carriage states that could happen in everyday conditions, whether people are moving around and going shopping, or in nursing homes, or they're sitting in a classroom or they are sitting at their desk at their work organization,” Sadjadi says.
One unexpected hiccup occurred when the original NIST development data set—based on 10 different recording sessions with MIT Lincoln Laboratory researchers holding phone pairs in different positions—led to the classic “overfitting” problem where machine learning performance is tuned too specifically to the conditions in a particular data set. The machine learning models were able to identify specific recording sessions by using air pressure information from the altitude sensors of the iPhones. That gave the models a performance boost in phone proximity detection for that specific training data set, but their performance could fall when faced with new data in real-world situations.
Luckily, one of the teams participating in the challenge reported the issue to NIST when it noticed its machine learning model prioritizing data from the altitude sensors. Once Sadjadi and his colleagues figured out what happened, they enlisted the help of the MITRE Corporation to collect new data based on the same data collection protocol and released the new training data set within a few days.
The team results on the final TC4TL leaderboard reflect the machine learning models’ performances based on the new training data set. But NIST still included a second table below the leaderboard results showing how the overfitted models performed on the original training data set. Such results are presented as a normalized decision cost function (NDCF) that represents proximity detection performance when accounting for the combination of false negatives (failing to detect a nearby phone) and false positives (falsely saying a nearby phone has been detected).
If the machine learning models only performed as accurately as flipping a coin on those binary yes-or-no questions about false positives and false negatives, their NDCF values would be 1. The fact that most of the machine learning models seemed to get values significantly below 1 represents a good sign for the promise of applying machine learning to boosting digital contact tracing down the line.
However, it’s still unclear what these normalized DCF values would actually mean for a person’s infection risk in real life. For future evaluations, the NIST team may focus on figuring out the best way to weight both the false positive and false negative error measures. “The next question is ‘What’s the relative importance of false-positives and false-negatives?’” Sadjadi explains. “How can the metric be adjusted to better correlate with realistic conditions?”
It’s also hard to tell which specific machine learning models perform the best for enhancing phone proximity detection. The four teams ended up trying out a variety of different approaches without necessarily finding the most optimal method. Still, Sadjadi seemed encouraged by the fact that even these early results suggest that machine learning can improve upon the baseline performance of Bluetooth signal detection alone.
“We hope that in the future the participants use our datasets and our metrics to drive the errors down further,” Sadjadi says. “But these results are far better than random."
The fact that the baseline performance of Bluetooth signal detection for detecting nearby phones still seems quite weak may not bode well for many of the current digital contact tracing efforts using Bluetooth-based apps—especially given the much higher error rates for situations when one or both phones is in someone’s pocket or purse. But Sadjadi suggests that current Bluetooth-based apps could still provide some value for overwhelmed public health systems and humans doing manual contact tracing.
“It seems like we’re not there yet when you consider everyday scenarios and real-life scenarios,” Sadjadi says. “But again, even in case of not so strong performance, it can still be useful, and it can probably still be used to augment manual contact tracing, because as humans we don't remember exactly who we were in contact with or where we were.”
Many future challenges remain before researchers can deliver enhanced Bluetooth-based proximity detection and a possible performance boost from machine learning. For example, Bluetooth-based proximity detection could likely become more accurate if phones spent more time listening for Bluetooth chirps from nearby phones, but tech companies such as Google and Apple have currently limited that listening time period in the interest of preserving phone battery life.
The NIST team is also thinking about how to collect more training data for what comes next beyond the TC4TL Challenge. Some groups such as the MIT Lincoln Laboratory have been testing the use of robots to conduct phone data collection sessions, which could improve the reliability of accurately-reported distances and other factors involved in tests. That may be useful for collecting training data. But Sadjadi believes that it would still be best to use humans in collecting the data used for the test data sets that measure machine learning models’ performances, so that the conditions match real life as closely as possible.
“This is not the first pandemic and it does not seem to be the last one,” Sadjadi says. “And given how important contact tracing is—either manual or digital contact tracing—for this kind of pandemic and health crisis, the next TC4TL challenge cycle is definitely going to be longer.”
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.