NSA's Data Center Electrical Problems Aren't That Shocking

Expert says electrical failures in mega data centers aren't unheard of

4 min read

NSA's Data Center Electrical Problems Aren't That Shocking
The new NSA data center in Bluffdale, Utah.
Photo: George Frey/Getty Images

Last week, the Wall Street Journal reported that arc-fault failures—electrical problems that exceed the protective capabilities of circuit breakers and cause metal to melt and go flying—are delaying completion of the NSA’s controversial new Utah data-storage center. The article reported that 10 such meltdowns over the past 13 months had led to disputes about the adequacy of the electric control systems, and suggested that designers and builders of the new data center may have cut corners.

A report from the design and construction team on exactly what is going on at the NSA Utah data center is pending; meanwhile, the Army Corps of Engineers has also been investigating, and has indicated that the causes of most of the failures remain a mystery.

So we’ll have to wait a bit for more details. But the NSA’s problems raise questions that extend beyond the Utah center or the NSA itself. We’re putting more and more of our data in the cloud daily, driving the construction of bigger and bigger data centers. Should we expect more data center fires in the future, or is the NSA situation unique?

For some answers, I turned to Dennis Symanski, a senior project manager at the Electric Power Research Institute (EPRI), whose research focuses on energy use in data centers.

First, I wanted to make sure I really understood the arc-fault problem. Symanski explained:

"A circuit breaker tries to open up, but the amperage is higher than it can handle, so when the contacts open it doesn’t break the circuit. Instead, you have an arc that is moving around between the contacts; it has huge magnetic forces to it, and it can’t be easily extinguished. You end up with molten metal flying around which is extremely dangerous."

Then I needed to understand just how complex a data center’s power design can be. Symanski walked me through a typical setup:

"You’ve got normal utility power coming into the back of the building, say, at whatever the distribution voltage is—it could be 23 kV, it could be 12 kV. That power hopefully is coming from two independent substations, so if there is a problem with one substation you can continue to feed power from the other substation. That power goes into a transformer that steps the power down to 3-phase 480 volts. Normally you also have a couple of diesel generators for backup power, in case all the utility goes away. That all ties in to metal clad switch gear; this is where the circuit breakers are located; it is metal clad so that if there is an arc fault somewhere it is contained.

The power from the substations and the diesel generators all goes into an uninterruptible power supply that converts the AC to DC and stores reserve power in batteries and then converts it back to AC again. The AC power is stepped down, typically to 208 volts in the U.S., then it goes into the power supplies in back of servers and storage arrays. These power supplies convert it to 380 volt DC. From there, it is converted within the server or storage array to all the different voltages you need, like 12 volts DC for fan, 5 volts for disk drives, 3.3 volts for memory."

An arc fault typically occurs inside the metal clad switch gear, where the power levels are the highest. But, Symanski says, it can happen downstream, if equipment designed for too low a current is installed, though that is less likely.

Why haven’t we heard much about arc faults in data centers before? The typical data center at a company, government office, or educational institution is using older, proven, designs, and dealing with far less power, says Symanski.

But companies like Google, Amazon, and, lately, the NSA, are entering uncharted waters in terms of data center design. Symanski explains:

"These new data centers are trying to push the envelope as far as efficiency and space. The electrical load of the computers is going up; some of these have hundreds of thousands of servers, some of these servers have six power cords going into them or some have massive supercomputers drawing 10 megawatts each. They are using huge amounts of power."

The NSA data center, for example, continuously uses 65 megawatts, according to the Wall Street Journal. A more typical large data center today uses about 20 megawatts. The data centers of the dot-com era used 1 or 2 megawatts.

When you’re pushing the envelope, startup problems, even as many as the NSA has, aren’t unheard of, Symanski says.

"In the startup phase, when you start energizing things, you typically find problems, unless this is the 14th identical facility you’ve designed and built using the exact equipment from the same manufacturer. Hopefully you can fix them, and it’s a one-time fix; sometimes you have recurring problems.  I can’t tell you whether any number is higher than expected.

You design these things with a little bit of margin involved in everything, but sometimes your calculations are a little off. And you are more apt to be a little off if you are designing something that is truly unique and bigger and more dense than ever before. That’s what makes engineering tough."

Symanski would like to get data center designers to think about using less power, which would both benefit the environment and improve reliability.

"All of these new data centers are doing a quite good job with cooling efficiency. But they need to do a better job of having electronic equipment—the servers, the storage arrays, the networking gear—be more efficient as well. To do so, they can take some of the already known technology from battery operated smart phones and smart tablets and start incorporating that into data centers."

Another way to improve data center efficiency and reliability, Symanski suggests, is setting up the entire data center computing facility to run on a DC network.

"With a 380-volt DC setup you have less equipment, you don’t need some of the conversion steps that require additional electronic devices that do fail, and you’ll have less losses, so you are running a more efficient and reliable system."

The Conversation (0)