Hey, Data Scientists: Show Your Machine-Learning Work

In the last two years, the U.S. Food and Drug Administration has approved several machine-learning models to accomplish tasks such as classifying skin cancer and detecting pulmonary embolisms. But for the companies who built those models, what happens if the data scientist who wrote the algorithms leaves the organization?

In many businesses, an individual or a small group of data scientists is responsible for building essential machine-learning models. Historically, they have developed these models on their own laptops through trial and error, and pass it along for production when it works. But in that transfer, the data scientist might not think to pass along all the information about the model’s development. And if the data scientist leaves, that information is lost for good.

That potential loss of information is why experts in data science are calling for machine learning to become a formal, documented process overseen by more people inside an organization.

Companies need to think about what could happen if their data scientists take new jobs, or if a government organization or an important customer asks to see an audit of the algorithm to ensure it is fair and accurate. Not knowing what data was used to train the model and how the data was weighted could lead to a loss of business, bad press, and perhaps regulatory scrutiny, if the model turns out to be biased.

David Aronchick, the head of open-source machine-learning strategy at Microsoft Azure, says companies are realizing that they must run their machine-learning efforts the same way they run their software-development practices. That means encouraging documentation and codevelopment as much as possible.

Microsoft has some ideas about what the documentation process should look like. The process starts with the researcher structuring and organizing the raw data and annotating it appropriately. Not having a documented process at this stage could lead to poorly annotated data that has biases associated with it or is unrelated to the problem the business wants to solve.

Next, during training, a researcher feeds the data to a neural network and tweaks how it weighs various factors to get the desired result. Typically, the researcher is still working alone at this point, but other people should get involved to see how the model is being developed—just in case questions come up later during a compliance review or even a lawsuit.

A neural network is a black box when it comes to understanding how it makes its decisions, but the data, the number of layers, and how the network weights different parameters shouldn’t be mysterious. The researchers should be able to tell how the data was structured and weighted at a glance.

It’s also at this point where having good documentation can help make a model more flexible for future use. For example, a shopping site’s model that crunched data specifically for Christmas spending patterns can’t apply that same model to Valentine’s Day spending. Without good documentation, a data scientist would have to essentially rebuild the model, rather than going back and tweaking a few parameters to adjust it for a new holiday.

The last step in the process is actually deploying the model. Historically, only at this point would other people get involved and acquaint themselves with the data scientist’s hard work. Without good documentation, they’re sure to get headaches trying to make sense of it. But now that data is so essential to so many businesses—not to mention the need to adapt quickly—it’s time for companies to build machine-learning processes that rival the quality of their software-development processes.

This article appears in the December 2019 print issue as “Show Your Machine-Learning Work.”

software Internet of Everything data scientist machine learning AI

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Hey, Data Scientists: Show Your Machine-Learning Work

Documenting software development is standard practice—the same should hold for algorithm design

Related Stories

Software Sucks, but It Doesn’t Have To

AI-Powered Proof Generator Helps Debug Software

GPUs Can Now Analyze a Billion Complex Vectors in Record Time

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes:

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Hey, Data Scientists: Show Your Machine-Learning Work

Documenting software development is standard practice—the same should hold for algorithm design

Related Stories

Software Sucks, but It Doesn’t Have To

AI-Powered Proof Generator Helps Debug Software

GPUs Can Now Analyze a Billion Complex Vectors in Record Time

This article is for IEEE members only. Join IEEE to access our full archive.

Membership includes: