Q&A: How Google Implements Code Coverage at Massive Scale

An analysis by Google researchers reveals how the company's engineers manage code coverage across one billion lines of code

5 min read
Screenshot of visualizing line coverage during code review
The line numbers (highlighted with the red rectangle) are colored to visualize code coverage. A line number is colored green if that line is covered, orange if that line is not covered, and white if that line is not instrumented.
Photo: Google

In software development, a common metric called code coverage measures the percentage of a system’s code that is covered by tests performed prior to deployment. Code coverage is typically measured automatically by a separate software program, or it can be invoked manually from the command line for certain code coverage tools. The results show exactly which lines of code were executed when running a test suite, and could reveal which lines may need further testing. 

Ideally, software development teams aim for 100 percent code coverage. But in reality, this rarely happens because of the different paths a certain code block could take, or the various edge cases that should (or shouldn’t) be considered based on system requirements.

Graph illustrating projects actively using coverage automation.More projects at Google have actively incorporated automated code coverage tools in recent years.Illustration: Google

Measuring code coverage has become common practice for software development and testing teams, but the question of whether this practice actually improves code quality is still up for debate.

Some argue that developers might focus on quantity rather than quality, creating tests just to satisfy the code coverage percentage instead of tests that are robust enough to identify high-risk or critical areas. Others raise concerns about its cost-effectiveness—it takes valuable developer time to review the results and doesn’t necessarily improve test quality. 

For a large organization such as Google—with a code base of one billion lines of code receiving tens of thousands of commits per day and supporting seven programming languages—measuring code coverage can be especially difficult.

A recent study led by Google AI researchers Marko Ivanković and Goran Petrović provides a behind-the-scenes look at the tech giant’s code coverage infrastructure, which consists of four core layers. The bottom layer is a combination of existing code coverage libraries for each programming language, while the middle layers automate and integrate code coverage into the company’s development and build workflows. The top layer deals with visualizing code coverage information using code editors and other custom tools.

As part of the study, Ivanković and Petrović analyzed code coverage adoption rates over a five-year period. They found that despite code coverage not being mandatory at Google, the rate of adoption has grown steadily since 2014. In the first quarter of 2018, more than 90 percent of projects used automated code coverage tools.  

The researchers also collected 512 survey responses from 3,000 randomly chosen Google developers and other employees in non-engineering roles on the usefulness of code coverage. Among the respondents, only 45 percent use code coverage frequently when authoring code changes, while 40 percent use it regularly when conducting code reviews.

Graph illustrating self-reported usefulness of changelist coverage Survey participants at Google rated the usefulness of code coverage when authoring a code change (red), reviewing a code change (blue), and browsing a code change (green).Illustration: Google

Ivanković spoke to IEEE Spectrum about the study and the role code coverage plays in software development and testing.

This interview has been edited and condensed for clarity.

IEEE Spectrum: Why do you think code coverage is important?

Marko Ivanković: A lot of people are probably expecting us to say something along the lines of, “Good coverage reduces [the] number of bugs.” That’s certainly part of it, but one of the more surprising insights [we found] was that even if coverage wasn’t directly helpful as a quality signal, it would still be worth computing.

Coverage might not directly be helpful for humans looking at the code, but it would still be helpful for tools—for example, a tool that analyzes dependencies. So for instance, if code A declares that it depends on code B, but tests for code A never reach code B, then it’s possible that the dependency is not real, and an automated tool can try to remove it to simplify the code base.

Of course, actual implementation is much more complex than that. We’ve found dozens of such tools that can use coverage information provided by our infrastructure to improve their own functionality. And for many of these use cases, the correlation between code coverage and code quality is not at all important.

IEEE Spectrum: What inspired you to study code coverage at Google?

Ivanković: We were inspired by a problem we faced ourselves. During code reviews, we were spending a lot of time trying to figure out if the tests actually test the code or not. At the time, coverage computation was supported by the build system, but you had to manually invoke it and manually overlay the coverage results and the code you were reviewing. One day, we just said to ourselves, “There has to be a way to automate this.” After a week, we had the first prototype running. Other engineers saw it and asked if they could get the same. We wanted to make sure we were providing them with the best possible experience so we started researching the problem.

IEEE Spectrum: What surprised you most about your results?

Ivanković: We were surprised by the number of people who were originally skeptical of coverage that ended up adopting the methodology and ultimately finding it useful. A number of people we surveyed were against coverage on principle, but they still admitted to using it sometimes and finding it useful. 

IEEE Spectrum: What's the biggest challenge you faced with your study, and how did you overcome it? 

Ivanković: On the surface, code coverage appears to be a simple concept: A line is either covered by tests or it isn’t. But it turned out to be full of corner cases and unexpected situations when implemented at scale. It took us several years of hard work to fix all failure modes in the infrastructure. 

We hit a similar challenge when we were conducting our study. Most engineers we surveyed had the same overall idea of what coverage is, but when asked for details, their responses differed widely. We had to try several surveys on smaller populations before we got the questions right. 

IEEE Spectrum: What do you see as the strengths of Google's code coverage infrastructure? What else do you think could be improved?

Ivanković: We worked hard to make sure the infrastructure is resource-efficient and can run at Google’s massive scale. Showing people that it’s possible to do this is probably the biggest contribution [of our study]. 

We designed our infrastructure in a way that makes it easy to experiment, do A/B testing, and evaluate hypotheses. We also export all data in accessible formats so coverage can be visualized, which helps teams keep their code healthy and prepare fix-it events.

When we were surveying engineers, some of them had improvement suggestions for us, some of which could be interesting to explore. One of the more playful ones was to not show the coverage results if they were too good, so the engineers don’t get overconfident.

IEEE Spectrum: What advice would you give software development and testing teams looking to deploy code coverage or improve their existing code coverage efforts?

Ivanković: I think the most important advice we could give is to focus on their workflow. Don’t just deploy coverage, but make sure you integrate it in the developer workflow at the right place, where the results are most useful. In our experience, code review is the cornerstone of code health.

IEEE Spectrum: What future developments are in store for Google’s code coverage infrastructure?

Ivanković: Currently, we’re looking further into the usage data and developer opinions to better understand how coverage is used. For example, we’re researching how the perceived usefulness differs from actual usefulness. A concrete question we would like to examine is, “Does showing coverage during code review actually speed up the review process?” The results of this research will determine our next infrastructure improvements.

The Conversation (0)