Facebook’s Vanishing Act Explained

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

On Monday, Facebook vanished from the Internet, along with the company's other platforms, Instagram and WhatsApp. Although all of them were back up and functioning by the end of the day, for the more than five hours during which they were down, Facebook potentially lost tens of millions of dollars in revenue. Analytics provider Haystack saw a 32 percent increase in developer throughput during the period Facebook was down, suggesting that developers actually got more work done than usual because of the outages.

But why did Facebook disappear from the Internet to begin with? As it turns out, it was initially a small bug that cascaded into bigger problems. And while Facebook's accounting of what went wrong checks out, some missing details raised questions for network experts.

Why did Facebook vanish?

According to Facebook's own explanation of the events, the problems started during a routine bit of maintenance on the company's internal backbone. This backbone is a series of fiber optic cables and data centers built and operated by Facebook to handle both internal communications and the external requests. Any time you log onto your Facebook account, browse Instagram, or send a message on WhatsApp, you're making such an external request.

Like any company maintaining a portion of the Internet's infrastructure, Facebook uses software tools to check on the status of its backbone. These tools are relatively simple: One might measure data throughput on a fiber line, for example, or temporarily take down one fiber line to test the redundancy of other lines. "These tools are not big, convoluted systems," says Yiannis Psaras, a researcher at Protocol Labs.

Apparently during some maintenance on Monday, a particular tool, instead of taking one line down for maintenance, sent out a command to take down every line. According to Psaras, it was as though the tool essentially cut every one of Facebook's fiber lines in half.

The problem compounded from there: Each of Facebook's own servers, unable to communicate with anything else, assumed it was the source of the fault, and therefore each one took itself offline.

Facebook uses larger data centers to hold all the content on its websites and apps, and smaller servers that handle Domain Name System (DNS) queries. DNS is often referred to as the Internet's phonebook—it's the system that converts plaintext URLs (such as spectrum.ieee.org) to IP addresses—the strings of numbers used to locate and retrieve a website's data.

When Facebook's DNS servers removed themselves from both Facebook's internal backbone, as well as the public-facing Internet, no one could reach anything Facebook-related for the same reason you can't call someone if you don't have their phone number. The DNS queries made by people trying to log onto their accounts all failed because there was no valid IP address to query.

Each of Facebook's own servers, unable to communicate with anything else, assumed it was the source of the fault, and therefore each one took itself offline.

As the hours passed, this slowed down the rest of the Internet too. Shiv Panwar, a researcher at NYU Wireless, explains that DNS is hierarchical—if a DNS query runs into a problem, it will check a wider range of servers to see if it can locate the information it needs. It's the equivalent of switching from a local phonebook to a regional one. In other words, people's attempts to log onto Facebook and Instagram affected requests for the rest of the Internet as their queries searched anywhere and everywhere for the information they were after.

Does Facebook's explanation make sense?

Yes, although the fact that a tool was the initial culprit surprised both Psaras and Panwar. Recall that the original problem was a tool sending out a command that managed to sever all of the routes between Facebook's data centers. "Why would the tool have this functionality, even as backup?" says Psaras.

Psaras explains that because software tools are designed to be simple, testing just one aspect of a network, it's a bit odd that the tool was able to cause a global screw up. It is possible, however, that Facebook does have and use a tool that could take down its internal backbone because of a bug.

Panwar suggests that the tool may have been designed to take down a particular route to test how the rest of the backbone picked up the slack. In other words, a tool designed to test the network's redundancy. It's easy to imagine a tool designed to take down a route, check the rest of the network, and bring the route back up before moving on to another potentially bugging out and taking down every route instead.

Could Facebook vanish again?

The answer is never "never," but it's unlikely. I've mentioned redundancy a few times—Facebook's backbone, like the rest of the Internet, has redundant routes to get from one location to another. Redundancy gives the Internet resiliency. It's not enough for one route to go down—every route has to fail to totally disrupt traffic.

The Internet is designed to survive "single-point failures." These are problems like a chip going bad, a link going down, a backhoe ripping a fiber line underneath a construction site, or someone pulling a plug in a data center. In fact, that's just the sort of thing Facebook's tool would have been testing for, if it was in fact supposed to be taking down one route to check the traffic load on the others.

Multi-point failures, on the other hand, are much rarer, because they're harder to pull off, either accidentally or intentionally. Of course, the fact that it did happen to Facebook is enough evidence that it's possible, if not probable.

You can rest assured, however, now that you're done reading this, you can log into Facebook or Instagram and know that they'll (almost definitely) load.

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Facebook’s Vanishing Act Explained

The company blames a buggy software tool for the global disruption

Why did Facebook vanish?

Does Facebook's explanation make sense?

Could Facebook vanish again?

Will Human Soldiers Ever Trust Their Robot Comrades?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Related Stories

IEEE Spectrum’s Top Telecom Stories of 2023

Electronic Warfare, Hackable 5G Networks, and Cell Towers on the Moon

5G Networks Are Worryingly Hackable

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Facebook’s Vanishing Act Explained

The company blames a buggy software tool for the global disruption

Why did Facebook vanish?

Does Facebook's explanation make sense?

Could Facebook vanish again?

Will Human Soldiers Ever Trust Their Robot Comrades?

Video Friday: RACER Heavy

As Ukraine Builds New Reactors, Renewables Beckon

Related Stories

IEEE Spectrum’s Top Telecom Stories of 2023

Electronic Warfare, Hackable 5G Networks, and Cell Towers on the Moon

5G Networks Are Worryingly Hackable