Facebook Engineers Crash Data Centers in Real-World Stress Test

“It’s easier to take a data center down than to put it back together,” says Facebook vice president of engineering Jay Parikh. But the company’s software engineers are getting better at the putting-it-back-together part, thanks to a series of regular stress tests conducted on Facebook’s operational network by the company’s disaster special weapons and tactics, or SWAT, team. Parikh described the effort, dubbed “Project Storm,” to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or maintain systems designed for vast numbers of users, including companies like Google, Airbnb, Dropbox, Spotify, Netflix, and others.

Facebook’s Project Storm originated in the wake of 2012’s Hurricane Sandy, Parikh reported. The superstorm threatened two of Facebook’s data centers, each carrying tens of terabits of traffic. Both got through Sandy unscathed, Parikh said, but watching the storm’s progress led the engineering team to consider what exactly would be the impact on Facebook’s global services if the company did indeed suddenly lose a data center or an entire region. The company assembled a SWAT team comprising the leaders of the various Facebook technology groups, who, in turn, marshaled the entire engineering workforce to figure out the answers.

The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported. They created tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, “to run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.”

In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly.

Other Facebook leaders didn’t think he’d actually do it, Parikh recalls. “I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.

Traffic flow to Facebook’s various systems turns chaotic after the company’s first test of a data center suddenly going offlinePhoto: Tekla Perry

That first takedown, which involved virtually the entire engineering team and a lot of people from the rest of the company, turned out to be a bit of a mess—at least from the inside. But users didn’t appear to notice. Parikh presented a chart tracking the traffic loads on various software systems—something that should have displayed smooth curves. “If you’re an engineer and see a graph like that, you’ve got bad data, your control system is not working right, or you have no idea what you’re doing.”

Regular stress tests have helped Facebook improve the way its overall systems respond to a data center outage.Photo: Tekla Perry

The Project Storm team forged ahead, continuing to hit Facebook’s networks with stress tests—although, Parikh recalls, there never seemed to be a good time to do them. “Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch.” And the switchovers got smoother.

The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says. “You need to push yourself to an uncomfortable place to get better.”

From Your Site Articles

Facebook’s Vanishing Act Explained - IEEE Spectrum ›

internet software engineers it hurricane sandy data centers facebook software engineering data recovery

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Facebook Engineers Crash Data Centers in Real-World Stress Test

After dodging disasters from Hurricane Sandy, Facebook instigates its own outages as part of Project Storm

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Why Electronic Health Records Haven't Helped U.S. With Vaccinations

Minsk’s Teetering Tech Scene

How Estonia's Management of Legacy IT Has Helped It Weather the Pandemic

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Facebook Engineers Crash Data Centers in Real-World Stress Test

After dodging disasters from Hurricane Sandy, Facebook instigates its own outages as part of Project Storm

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Why Electronic Health Records Haven't Helped U.S. With Vaccinations

Minsk’s Teetering Tech Scene

How Estonia's Management of Legacy IT Has Helped It Weather the Pandemic