The global race to build more powerful supercomputers is focusedon the next big milestone: a supercomputer capable of performing 1 million trillion floating-point operations per second (1 exaflops). Such a system will require a big overhaul of how these machines compute, how they move data, and how they’re programmed. It’s a process that might not reach its goal for eight years. But the seeds of future success are being designed into two machines that could arrive in just two years.
China and Japan each seem focused on building an exascale supercomputer by 2020. But the United States probably won’t build its first practical exascale supercomputer until 2023 at the earliest, experts say. To hit that target, engineers will need to do three things. First they’ll need new computer architectures capable of combining tens of thousands of CPUs and graphics-processor-based accelerators. Engineers will also need to deal with the growing energy costs required to move data from a supercomputer’s memory to the processors. Finally, software developers will have to learn how to build programs that can make use of the new architecture.
“To some degree it depends on how much money a country is willing to spend,” says Steve Scott, senior vice president and chief technology officer at Cray. “You could build an exaflop computer tomorrow, but it’d be a crazy thing to do because of the cost and energy required to run it.”
Simply scaling up today’s supercomputer architecture to build an exascale supercomputer would lead to a machine that requires the equivalent of a gigawatt-scale nuclear power plant, wrote Peter Kogge, a computer scientist and engineer at Notre Dame University, in IEEE Spectrum in January 2011. Instead, the U.S. government hopes to achieve practical exascale supercomputing in the 2020s at a cost of about US $200 million and 20 to 30 megawatts of power, says Horst Simon, deputy director at Lawrence Berkeley National Laboratory, in California. (One megawatt could cost a U.S. national lab about $1 million annually.)
The U.S. Department of Energy recently announced that it will invest $325 million in a pair of supercomputers—capable of performing one-tenth of an exaflops or more—being developed by IBM, Mellanox, Nvidia Corp., and other companies for a 2017 debut. The planned supercomputers, named Summit and Sierra, will rely on a new computer architecture that stacks memory near the Nvidia GPU accelerators and IBM CPUs. That architecture’s method of minimizing the energy costs of moving data between the memory storage and processors is a big step toward exaflops supercomputers, experts say.
Practical exascale computing will need additional development of stacked memory and faster, more energy-efficient interconnects to boost the performance of densely packed supercomputer chips, Simon explains. But he anticipates the need for other technological tricks too. One such technology—silicon photonics—would use lasers to provide low-power data links within the system.
Power and cost aren’t the only problems preventing practical exascale systems. The risk of hardware failures grows as new supercomputers pack in a greater number of components, says Bronis de Supinski, chief technical officer for Livermore Computing at Lawrence Livermore National Laboratory, in California. His lab’s IBM Blue Gene/Q supercomputer, named Sequoia, currently has a mean time between failures of 3.5 to 7 days. Such a window could shrink to just 30 minutes for an exascale system.
That’s hardly enough time for researchers to run complex simulations or other applications. But software capable of automatically restarting programs could help supercomputing systems recover from some hardware errors. “This is an instance in which the physical realities of hardware...end up creating challenges which we have to handle in software,” De Supinksi says.
Experts also point to the challenge of writing software applications that work for tens or hundreds of thousands of CPUs running in parallel. The programming becomes even more complex for the newer supercomputing architecture that includes GPU accelerators. That’s why Nvidia, based in Santa Clara, Calif., and its partner companies working on the planned Summit and Sierra machines have already reached out to thousands of software developers at universities around the world to begin teaching them about its accelerators.
Beyond Sierra and Summit, the U.S. Department of Energy has invested an additional $100 million in paving the way toward exascale supercomputing. But such investments won’t benefit just the few big U.S. government labs that can afford such machines. The new computer architectures needed to make an exascale supercomputer could also make supercomputing more widely accessible, says Sumit Gupta, general manager for the Tesla accelerated computing business at Nvidia.
“One thing I’m always intrigued by is, once we have an exascale machine, how small will a petaflop machine be?” Gupta says. “Will it fit in a backpack or under my desk? What is the research that the average graduate student could do that they can’t do today? I always find that aspect much more intriguing.”