On New Year’s Eve, millions of people will use Facebook’s Messenger app to wish friends and family a “Happy New Year!” If everything goes smoothly, those messages will reach recipients in fewer than 100 milliseconds, and life will go on. But if the service stalls or fails, a small team of software engineers based in the company’s New York City office will have to answer for it.
Isaac Ahdout, engineering manager, and Thomas Georgiou, software engineer, are both on that team. They’ve tested and tweaked the app throughout the year and will soon face their biggest annual performance exam. Messenger’s 1.3 billion monthly active users send more messages on New Year’s Eve than on any other day of the year. Many hit “send” (represented as a blue arrow in the app) immediately after the clock strikes midnight in their respective time zones.
“There’s like this firehouse you can’t stop, of deliveries you have to make,” says Georgiou. “We have to keep up. Otherwise, you end up in a bad situation.”
It’s a problem familiar to anyone who works on networks or services that see a dramatic spike in use at a particular time of day or year. U.S. telecommunications companies frequently install new base stations ahead of Super Bowls, state fairs, and presidential inaugurations for similar reasons.[shortcode ieee-pullquote quote=""The biggest thing we worry about is: How do you prevent that cascading failure from happening?"" float="left" expand=1]
For Facebook’s Messenger team, the challenge is slightly more complicated than shuffling a simple message from one user to another. Facebook allows people to set up large group chats, and shows senders a receipt every time a message is delivered, sent, or read. These features compound the total number of messages that must be distributed across the service.
Users also send and receive a higher percentage of photos and videos as they ring in the new year, compared to an average day. And people often try to resend messages that don’t appear to make it through right away, which piles on more requests.
Or, as Ahdout puts it, “once you start falling behind, you fall behind more.”
“The biggest thing we worry about is: How do you prevent that cascading failure from happening?” adds Georgiou.
One way is to perform extensive load testing ahead of time, to simulate the volume of messages that Facebook expects on New Year’s Eve based on activity in previous years. (The company declined to share its forecasts, and would not say how many messages were sent in previous years.) Load testing allows the team to validate how many messages a given server can handle before the team must shift traffic over to other servers in the network.
During the last New Year’s Eve, for example, one data center struggled with the volume of incoming messages, so the team directed traffic away from that center to another one. Following that incident, the group built tools to allow them to make those kinds of changes more easily this year.
Facebook’s Messenger infrastructure team gathers for a photo.Photo: Facebook
In addition to shifting loads, the Messenger team has developed other levers that it can pull “if things get really bad,” says Ahdout. Every new message sent to a server goes into a queue as part of a service called Iris. There, messages are assigned a timeout—a period of time after which, that message will drop out of the queue to make room for new messages. During a high-volume event, this allows the team to quickly discard certain types of messages, such as read receipts, to focus its resources on delivering ones that users have composed.
“We set up our systems so that if it comes to that, they start shedding the lowest-priority traffic,” says Ahdout. “So if it came to it, Iris would rather deliver a message and drop the read receipt, rather than drop the message and deliver the read receipt.”
Georgiou says the group can also sacrifice the accuracy of the green dot displayed in the Messenger app that indicates a friend is currently online. Slowing the frequency at which the dot is updated can relieve network congestion. Or, the team could instruct the system to temporarily delay certain functions—such as deleting information about old messages—for a few hours to free up CPUs that would ordinarily perform that task, in order to process more messages in the moment.[shortcode ieee-pullquote quote=""We set up our systems so that if it comes to that, they start shedding the lowest-priority traffic."" float="right" expand=1]
All of these options fall under the notion of “graceful degradation,” says Ahdout. “Rather than having your service dying on the floor and no one using it, you make it a little less awesome and people can still use it.” Fortunately, the Messenger team didn’t have to resort to any of these measures last year.
Aside from those efforts, Messenger’s engineers also spend a lot of time on efficiency projects designed to make the most of the CPUs and memory within each server. Ahead of New Year’s Eve 2018, for example, the team added a scheduler, which is a program that allows the system to “batch” similar messages together.
“You can imagine that our servers are getting many requests concurrently,” explains Ahdout. “You can bundle some of those together into a single large request before you send it downstream. Doing that, you reduce the computational load on downstream systems.”
Batches are formed based on a principle called affinity, which can be derived from a variety of characteristics. For example, two messages may have higher affinity if they are traveling to the same recipient, or require similar resources from the back end. As traffic increases, the Messenger team can have the system batch more aggressively. Doing so will increase latency (a message’s roundtrip delay) by a few milliseconds, but makes it more likely that all messages will get through.
This year for New Year’s Eve, neither Ahdout nor Georgiou will be on duty as midnight approaches in Asia, when the service sees its largest spike in messages, but Ahdout says he will stay close to his laptop, just in case. “Basically, a lot of this work never really sees the light of day, in the sense that things go well, or if they don’t, we handle them so gracefully that users don’t even know what happened,” he says.
“It’s sort of been awhile since there was a major problem,” he adds. Fingers crossed.