“Those of us in machine learning are really good at doing well on a test set,” says machine learning pioneer Andrew Ng, “but unfortunately deploying a system takes more than doing well on a test set.”
Speaking via Zoom in a Q&A session hosted by DeepLearning.AI and Stanford HAI, Ng was responding to a question about why machine learning models trained to make medical decisions that perform at nearly the same level as human experts are not in clinical use. Ng brought up the case in which Stanford researchers were able to quickly develop an algorithm to diagnose pneumonia from chest x-rays—one that, when tested, did better than human radiologists. (Ng, who co-founded Google Brain and Coursera, is currently a professor at Stanford University.)
There are challenges in making a research paper into something useful in a clinical setting, he indicated.
“It turns out,” Ng said, “that when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions.”
But, he said, “It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. In contrast, any human radiologist can walk down the street to the older hospital and do just fine.
“So even though at a moment in time, on a specific data set, we can show this works, the clinical reality is that these models still need a lot of work to reach production.”
This gap between research and practice is not unique to medicine, Ng pointed out, but exists throughout the machine learning world.
“All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says. “The full cycle of a machine learning project is not just modeling. It is finding the right data, deploying it, monitoring it, feeding data back [into the model], showing safety—doing all the things that need to be done [for a model] to be deployed. [That goes] beyond doing well on the test set, which fortunately or unfortunately is what we in machine learning are great at.”
Tekla S. Perry is a senior editor at IEEE Spectrum. Based in Palo Alto, Calif., she's been covering the people, companies, and technology that make Silicon Valley a special place for more than 30 years. An IEEE member, she holds a bachelor's degree in journalism from Michigan State University.