“It’s easier to take a data center down than to put it back together,” says Facebook vice president of engineering Jay Parikh. But the company’s software engineers are getting better at the putting-it-back-together part, thanks to a series of regular stress tests conducted on Facebook’s operational network by the company’s disaster special weapons and tactics, or SWAT, team. Parikh described the effort, dubbed “Project Storm,” to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or maintain systems designed for vast numbers of users, including companies like Google, Airbnb, Dropbox, Spotify, Netflix, and others.
Facebook’s Project Storm originated in the wake of 2012’s Hurricane Sandy, Parikh reported. The superstorm threatened two of Facebook’s data centers, each carrying tens of terabits of traffic. Both got through Sandy unscathed, Parikh said, but watching the storm’s progress led the engineering team to consider what exactly would be the impact on Facebook’s global services if the company did indeed suddenly lose a data center or an entire region. The company assembled a SWAT team comprising the leaders of the various Facebook technology groups, who, in turn, marshaled the entire engineering workforce to figure out the answers.
The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported. They created tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, “to run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.”
In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly.
Other Facebook leaders didn’t think he’d actually do it, Parikh recalls. “I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.
Traffic flow to Facebook’s various systems turns chaotic after the company’s first test of a data center suddenly going offline Photo: Tekla Perry
That first takedown, which involved virtually the entire engineering team and a lot of people from the rest of the company, turned out to be a bit of a mess—at least from the inside. But users didn’t appear to notice. Parikh presented a chart tracking the traffic loads on various software systems—something that should have displayed smooth curves. “If you’re an engineer and see a graph like that, you’ve got bad data, your control system is not working right, or you have no idea what you’re doing.”
Regular stress tests have helped Facebook improve the way its overall systems respond to a data center outage. Photo: Tekla Perry
The Project Storm team forged ahead, continuing to hit Facebook’s networks with stress tests—although, Parikh recalls, there never seemed to be a good time to do them. “Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch.” And the switchovers got smoother.
The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says. “You need to push yourself to an uncomfortable place to get better.”
Tekla S. Perry is a senior editor at IEEE Spectrum. Based in Palo Alto, Calif., she's been covering the people, companies, and technology that make Silicon Valley a special place for more than 30 years. An IEEE member, she holds a bachelor's degree in journalism from Michigan State University.