In the last two years, the U.S. Food and Drug Administration has approved several machine-learning models to accomplish tasks such as classifying skin cancer and detecting pulmonary embolisms. But for the companies who built those models, what happens if the data scientist who wrote the algorithms leaves the organization?
In many businesses, an individual or a small group of data scientists is responsible for building essential machine-learning models. Historically, they have developed these models on their own laptops through trial and error, and pass it along for production when it works. But in that transfer, the data scientist might not think to pass along all the information about the model’s development. And if the data scientist leaves, that information is lost for good.
That potential loss of information is why experts in data science are calling for machine learning to become a formal, documented process overseen by more people inside an organization.
Companies need to think about what could happen if their data scientists take new jobs, or if a government organization or an important customer asks to see an audit of the algorithm to ensure it is fair and accurate. Not knowing what data was used to train the model and how the data was weighted could lead to a loss of business, bad press, and perhaps regulatory scrutiny, if the model turns out to be biased.
David Aronchick, the head of open-source machine-learning strategy at Microsoft Azure, says companies are realizing that they must run their machine-learning efforts the same way they run their software-development practices. That means encouraging documentation and codevelopment as much as possible.
Microsoft has some ideas about what the documentation process should look like. The process starts with the researcher structuring and organizing the raw data and annotating it appropriately. Not having a documented process at this stage could lead to poorly annotated data that has biases associated with it or is unrelated to the problem the business wants to solve.
Next, during training, a researcher feeds the data to a neural network and tweaks how it weighs various factors to get the desired result. Typically, the researcher is still working alone at this point, but other people should get involved to see how the model is being developed—just in case questions come up later during a compliance review or even a lawsuit.
A neural network is a black box when it comes to understanding how it makes its decisions, but the data, the number of layers, and how the network weights different parameters shouldn’t be mysterious. The researchers should be able to tell how the data was structured and weighted at a glance.
It’s also at this point where having good documentation can help make a model more flexible for future use. For example, a shopping site’s model that crunched data specifically for Christmas spending patterns can’t apply that same model to Valentine’s Day spending. Without good documentation, a data scientist would have to essentially rebuild the model, rather than going back and tweaking a few parameters to adjust it for a new holiday.
The last step in the process is actually deploying the model. Historically, only at this point would other people get involved and acquaint themselves with the data scientist’s hard work. Without good documentation, they’re sure to get headaches trying to make sense of it. But now that data is so essential to so many businesses—not to mention the need to adapt quickly—it’s time for companies to build machine-learning processes that rival the quality of their software-development processes.
This article appears in the December 2019 print issue as “Show Your Machine-Learning Work.”