Under the Hood at Google and Facebook
A peek at the data centers, servers, and software that keep us feeling connected
It was 1999, and Google founders Larry Page and Sergey Brin, fresh from incorporating their oddly named company, needed servers—lots of them. So they went shopping for PC motherboards, disk drives, and power supplies. Not long before, though, they'd been cash-strapped grad students, so to save money, they kludged together four motherboards to one power supply, mounting them on cookie trays, which they lined with cork to avoid shorting anything out. Then they crammed these ugly yet functional servers into racks with network cables dangling everywhere.
It goes without saying that Google's technical infrastructure has improved since those slapdash early days. But Google is loath to reveal much about its back-end operations. In interviews with IEEE Spectrum, the company's engineers would often preface their purposely vague answers with,
"We don't want to talk about specifics" or "We can't talk about it a lot." Google has even attempted to keep secret the locations of many of its three dozen or so data centers, which include 20 scattered across the United States. Of course, that is absurdly hard to do with multimillion-dollar warehouselike facilities that must be approved by local officials, checked by government inspectors, and constructed in plain sight. So considerable information about Google's data infrastructure can now be found by just, well, googling Google.
Facebook, too, quickly catapulted from a student project to a dominant player on the Web. And Facebook's engineers have also had to pedal hard to keep up with the site's speedy rise in popularity. Indeed, these two companies have in many respects led strangely parallel lives—each of them opened its first data center in its seventh year of operation, for example. But Google and Facebook differ in fundamental ways as well, particularly in how they've produced the software that creates all the things we've come to expect from them, and also in how open they are about their technical operations.
Of course, those operations have had to grow in size and complexity to match the exponential rise in demand. Now, on any given day, Google's search engine fields more than a billion queries, and more than a quarter billion people visit the Facebook site. These companies have both had to mount massive engineering efforts to handle all that traffic, and the results of those labors have been impressive indeed.
Google's data center in The Dalles, Ore., completed in 2006, is one of the first that the company built rather than leased. At the time, Google was so hush-hush about this project that it required town officials to sign confidentiality agreements that precluded their even mentioning the facility to the press. Although Google is open enough now about having a data center located on this particular bend of the Columbia River, to this day Google Earth displays only overhead views of the site taken before construction commenced.
Google's if-we-tell-you-we'll-have-to-kill-you attitude toward its data centers isn't ironclad, however. For example, in 2009, Google hosted an energy-efficient data-center summit, where it revealed much about its operations. Days later, a narrated video tour of one of its early data centers, which the company refers to publicly only as "Data Center A," appeared on YouTube, which Google owns. This facility's more than 45 000 servers are mounted in 45 giant shipping containers, giving the interior of the cavernous building a strangely temporary look—as if Google wanted to be able to pack up and move all these servers to another location at a moment's notice.
This modular approach is not unique to Google, but it's not standard practice either. Google has also departed from data-center tradition in the way it handles power outages. Backup generators kick in when the grid fails, but they don't work instantly, requiring an uninterruptible power supply (UPS) to keep each server running for the first 10 seconds or more after the lights flicker off.
Most data centers use large centralized UPSs to carry them through that precarious interval, but those units are inherently wasteful. That's because the AC voltage feeding each one gets converted to DC and then back to AC, which is sent to the various power-distribution units and then to the individual servers. Those conversions even out voltage sags and spikes, but they squander about 10 percent of the electricity going to the computing equipment. To avoid these losses, Google instead attaches a small UPS to each server.
Another energy-conservation measure Google shows off in its YouTube data-center tour is a cooling technique known as water-side economization. That's important, because just a few years ago cooling accounted for 40 percent of a typical data center's electricity bill. Water-side economization cuts down on costs because it uses—you guessed it—water, which is dripped over heat exchangers outside the building. A closed cooling loop that passes through the server-packed containers inside brings warm water to these heat exchangers. As the water dripping over them evaporates, it carries away much of the heat. That tactic, combined with the right climate, keeps the servers cool enough most of the time, and conventional chillers can be put into the loop to assist as needed.
Google's data centers use electricity far more efficiently than was typical in the past. Data-center experts gauge efficiency using a statistic called power usage effectiveness (PUE), which is computed by dividing all the electrical power used in a facility by the power delivered to just the computers and related networking equipment. By the end of 2010, Google's data centers achieved an overall PUE of 1.13—impressively close to the ideal value of 1.0. This is a great improvement from five years ago, when a sampling of 22 data centers showed an average PUE of 2.0, meaning that for each watt actually used for computing and networking tasks, another watt is squandered on chillers, lights, fans, pumps, office equipment, and probably more than a few snack machines.
Although other large data centers can boast efficiencies that rival those Google has reported, Google's PUE numbers will probably creep even lower as newer, more energy-efficient centers come on line. The company is now working on a data center in Finland that will cool servers with seawater pumped in from the Baltic, for example. But don't think that Google's engineers are the only ones who know how to build super-energy-efficient data centers. Facebook's engineers are now up to speed as well, and they've been eager to show off their results.
Tramping around the red-clay expanse where Facebook is constructing its newest data center, located in what was until recently a patch of woods east of the Blue Ridge Mountains, I'm struck by the immense scale of the project: the 34 000-square-meter (370 000-square-foot) structure looming in front of me stretches some 340 meters (about 1100 feet) end to end. Imagine a Walmart supercenter on steroids, and you still won't be thinking large enough. What's more, Facebook may one day place three of these behemoths here. Even more stunning than its size is the speed at which Facebook's general contractor, DPR/Fortis, has put all this concrete and steel in place, having broken ground at the site just four months earlier.
"I'm amazed at this one," says Tom Furlong, who directs site operations for Facebook's data centers. At our meeting in a construction trailer at the site in North Carolina's rural Rutherford County, Furlong tells me that in Facebook's early years it could get away with colocation (where one building houses servers for several companies) and then, as the company grew, with leasing entire data centers.
But the situation changed in 2008, when much of the world economy slowed to a crawl. At the time, most of Facebook's servers were in leased data centers in the San Francisco Bay area, where the company continues to occupy eight separate facilities tied together with high-speed data links. It also began leasing data-center space in northern Virginia to create a matching East Coast hub of operations.
But when he went looking for more space, Furlong says, he discovered that many data-center projects had been shelved—victims of the financial crisis. Ultimately, the search for space proved so frustrating that in early 2009 Facebook decided to build a data center of its own. By August, Furlong and his colleagues had settled on Prineville, Ore., just 150 kilometers from Google's facility in The Dalles. Before the Prineville center was even completed, the company announced plans to build a second one in North Carolina.
As if to prove Facebook CEO Mark Zuckerberg's contention that openness is a new social norm, Facebook's engineers have released many of the technical specs for these state-of-the-art data facilities, which they believe will have PUE values of 1.07 or less. They call their sharing the Open Compute Project. They're not exactly open sourcing their hardware for anyone else to duplicate, but the descriptions they offer are surprisingly detailed.
The Prineville center, which officially came on line in April, departs markedly from the containerized approach Google favors. "You can save some money with modularity, but you get restricted in a lot of different ways," says Furlong. So Facebook houses its servers in conventional racks placed directly on the data-center floor, where they are cooled by the flow of air blowing in from the side. But that air doesn't come from conventional air-conditioning equipment. Jay Park, Facebook's director of data-center design and construction, explains that they use "direct evaporative cooling." A fine mist cools the air, which in turn cools the servers. "It's a big honkin' misting system," Park says.
Direct evaporative cooling fits well with the overall philosophy that Facebook's data-center engineers have come to favor: simplicity. "Our data center in Prineville basically uses outside air," says Park. "We filter the air, pass it through the misting system, and then blow that into the data center. If the outside air is too cold, we'll recirculate some of it. If not, we'll just dump it out." Park boasts that the mechanical design is so straightforward that it doesn't even require ductwork, and his computational fluid dynamics calculations show that it still does the job. Facebook's North Carolina facility will be similarly configured, although it will include back-up air conditioners to supplement the misting system during hot spells.
Facebook's data centers, like Google's, dispense with centralized UPSs. Instead, one phase of the three-phase, 480-volt AC derived from the main utility switchboard is sent directly to the servers, which contain custom power supplies that use what amounts to a 277-volt AC feed. During a power outage, those supplies can also run off the 48-volt DC coming from specially engineered UPS cabinets installed next to the server racks.
That arrangement saves watts, and it also simplifies maintenance, because there are far fewer pieces of equipment to maintain. "In traditional data centers, with the UPSs higher up the food chain, you have a lot of additional breakers and connections to bypass," says Furlong. "Moving the UPS close to the server gives you the flexibility not to have all that hardwired extra stuff."
Giant data centers—even energy-efficient ones—are, of course, nothing without the proper servers. Facebook will be populating its Oregon and North Carolina locations with custom-designed servers, just as Google has long done.
Facebook's Amir Michael, manager of hardware design, explains that when the company decided to build its own facilities, "we had a clean slate," which allowed him and his colleagues to optimize the designs of their centers and servers in tandem for maximum energy efficiency. The result was a server that "looks very bare bones. I call it a 'vanity-free' design just because I don't like people to call it ugly," says Michael. "It has no front bezels. It has no paint. It has no logos or stickers on it. It really has only what is required."
Google also keeps server frills to a minimum. Like Facebook, it buys commodity-level computing hardware and just fixes the many pieces that break, instead of purchasing high-end systems that are less prone to failure but also much more expensive. Economics, if nothing else, drove engineers at both companies to similar conclusions here. Fit and finish might count if you're buying one server or even a hundred, but not when you're shopping for tens of thousands at a time. And striving for high reliability is a little pointless at this scale, where failure is not only an option, it's a daily fact of life.
Facebook's Michael explains that he helped design three basic types of servers for running the Facebook application. The top layer of hardware, connected most directly with Facebook's many users, consists of outward-facing Web servers. They don't require much disk space—just enough for the operating system (Linux), the basic Web-server software (which until recently was Apache), the code needed to assemble Facebook pages (written in PHP, a scripting language), some log files, and a few other bits and pieces. Those machines are connected to a deeper layer of servers stuffed with hard disks and flash-based solid-state drives, which provide persistent storage for the giant MySQL databases that hold Facebook users' photos, videos, comments, and friend lists, among other things. In between are RAM-heavy servers that run a memcached system to provide fast access to the most frequently used content.
Alpha geeks will recognize that these pieces of software—Linux, Apache, PHP, MySQL, memcached—all hail from the open-source community. Facebook's programmers have modified these and other open-source packages to suit their needs, but at the most basic level, they are doing exactly what countless Web developers have done: building their site on an open-source foundation.
Not so at Google. Programmers there have written most of their company's impressive software from scratch—with the exception of the Linux running on its servers. Most prominent are the Google File System (or GFS, a large-scale distributed file system), Bigtable (a low-overhead database), and MapReduce (which provides a mechanism for carrying out various kinds of computations in a massively parallel fashion). What's more, Google's programmers have rewritten the company's main search code more than once.
Speaking two years ago at the Second ACM International Conference on Web Search and Data Mining, Jeff Dean, a Google Fellow working in the company's system infrastructure group, said that over the years his company has made seven significant revisions to the way it implements Web search. However, outsiders don't realize that, because, as Dean explained, "you can replace the entire back end without anyone really noticing."
How are we to interpret the difference between Google's and Facebook's engineering cultures with respect to the use of open-source code? Part of the answer may just be that Google, having started earlier, had no choice but to develop its own software, because open-source alternatives weren't yet available. But Steve Lacy, who worked as a software engineer for Google from 2005 to 2010, thinks otherwise. In a recent blog post, he argues that Google just suffers from a bad case of not-invented-here syndrome. Many open-source packages "put Google infrastructure to shame when it comes to ease of use and product focus," writes Lacy. "[Nevertheless, Google] engineers are discouraged from using these systems, to the point where they're chastised for even thinking of using anything other than Bigtable/Spanner and GFS/Colossus for their products."
Might Google's or Facebook's infrastructure yet crack under their ever-increasing loads? Facebook's regular user base has mushroomed to more than half a billion people, and it continues to add more than 20 million users a month. And Google must devote vast computing resources to keep up with the 34 000 searches it performs each second, while running ad auctions, translating languages, handling Gmail traffic, hosting YouTube videos, and more. Can all this just go on and on without end?
It seems it can. While the costs are enormous, these companies appear to be handling the computing burden with relative ease. But maybe that shouldn't be surprising. After all, if they don't have adequate horsepower, they can always delay the introduction of whatever resource-intensive service they're working on—and they both roll out such features regularly enough.
Take Google Instant, an instant-search function that the company introduced last September. "The point was to increase the delight factor," says Ben Gomes, who headed the Google Instant team. Instant looks at the first letters you key into a search query and offers a page of results based on what it anticipates you intend to type. So for even a simple query, multiple searches must now go on in parallel, and further calculations must be carried out to choose which results to show. Would Google have made such a change if its engineers had any doubts about the ability of their systems to take the punishment?
Similarly, Facebook can certainly control how heavily its users tax its system. As of last year, for instance, Facebook users had uploaded 50 billion pictures to the site. And yet, despite the enormous amount of bandwidth and storage space eaten up by all those photos, Facebook has periodically boosted the resolution of the images users can save. Would it have done that if its operations engineers felt their computing infrastructure was in danger of collapse?
My point is that neither Google nor Facebook is likely to falter in scaling up their systems to match demand. Sure, there will be glitches and slowdowns from time to time. But it seems unlikely that either company will suffer from a long-term lack of computing oomph as they continue to shape the way we run our online lives. Just how quickly they'll have to build new data centers, and what new kinds of energy-saving technology those centers will contain, is anyone's guess. But one thing's for sure: Their servers will always be ugly.