The December 2022 issue of IEEE Spectrum is here!

Close bar

Facebook Engineers Crash Data Centers in Real-World Stress Test

After dodging disasters from Hurricane Sandy, Facebook instigates its own outages as part of Project Storm

3 min read
Facebook's Vice President of Engineering Jay Parikh addresses the engineers attending the third annual @Scale conference
Facebook's Vice President of Engineering Jay Parikh addresses the engineers attending the third annual @Scale conference
Photo: Facebook

“It’s easier to take a data center down than to put it back together,” says Facebook vice president of engineering Jay Parikh. But the company’s software engineers are getting better at the putting-it-back-together part, thanks to a series of regular stress tests conducted on Facebook’s operational network by the company’s disaster special weapons and tactics, or SWAT, team. Parikh described the effort, dubbed “Project Storm,” to the audience of invited engineers at the third annual @Scale conference held in San Jose this week. @Scale brings together engineers who build or maintain systems designed for vast numbers of users, including companies like Google, Airbnb, Dropbox, Spotify, Netflix, and others.

Facebook’s Project Storm originated in the wake of 2012’s Hurricane Sandy, Parikh reported. The superstorm threatened two of Facebook’s data centers, each carrying tens of terabits of traffic. Both got through Sandy unscathed, Parikh said, but watching the storm’s progress led the engineering team to consider what exactly would be the impact on Facebook’s global services if the company did indeed suddenly lose a data center or an entire region. The company assembled a SWAT team comprising the leaders of the various Facebook technology groups, who, in turn, marshaled the entire engineering workforce to figure out the answers.

The group began running a number of tests and fine-tuning mechanisms for shifting traffic should a data center drop from the network, Parikh reported. They created tools and checklists of tasks both manual and automated; and they set time standards for completing each task. We wanted, Parikh said, “to run like a pit stop at a race; to get everything fixed on the car in the shortest period of time, realizing, however, that this is like taking apart an aircraft carrier and putting it back together in a few hours, not just taking apart a toy that I got for Christmas.”

In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly. 

Other Facebook leaders didn’t think he’d actually do it, Parikh recalls. “I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.

imgTraffic flow to Facebook’s various systems turns chaotic after the company’s first test of a data center suddenly going offlinePhoto: Tekla Perry

That first takedown, which involved virtually the entire engineering team and a lot of people from the rest of the company, turned out to be a bit of a mess—at least from the inside. But users didn’t appear to notice. Parikh presented a chart tracking the traffic loads on various software systems—something that should have displayed smooth curves. “If you’re an engineer and see a graph like that, you’ve got bad data, your control system is not working right, or you have no idea what you’re doing.”

imgRegular stress tests have helped Facebook improve the way its overall systems respond to a data center outage.Photo: Tekla Perry

The Project Storm team forged ahead, continuing to hit Facebook’s networks with stress tests—although, Parikh recalls, there never seemed to be a good time to do them. “Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch.” And the switchovers got smoother.

The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says. “You need to push yourself to an uncomfortable place to get better.”

The Conversation (0)

Why Functional Programming Should Be the Future of Software Development

It’s hard to learn, but your code will produce fewer nasty surprises

11 min read
Vertical
A plate of spaghetti made from code
Shira Inbar
DarkBlue1

You’d expectthe longest and most costly phase in the lifecycle of a software product to be the initial development of the system, when all those great features are first imagined and then created. In fact, the hardest part comes later, during the maintenance phase. That’s when programmers pay the price for the shortcuts they took during development.

So why did they take shortcuts? Maybe they didn’t realize that they were cutting any corners. Only when their code was deployed and exercised by a lot of users did its hidden flaws come to light. And maybe the developers were rushed. Time-to-market pressures would almost guarantee that their software will contain more bugs than it would otherwise.

Keep Reading ↓Show less
{"imageShortcodeIds":["31996907"]}