The Beating Heart of the World’s First Exascale Supercomputer

These chips power Frontier past 1,100,000,000,000,000,000 operations per second

4 min read

An image of Frontier, the worlds first exascale supercomputer at Oak Ridge National Laboratory in Tennessee, U.S..
Carlos Jones/ORNL/U.S. Department of Energy

The world’s latest fastest supercomputer, Frontier at Oak Ridge National Lab, in Tennessee, is so powerful that it operates faster than the next seven best supercomputers combined and more than twice as well as the No. 2 machine. Frontier is not only the first machine to break the exascale barrier, a threshold of a billion billion calculations per second, but is also ranked No. 1 as the world’s most energy-efficient supercomputer. Now the companies that helped build Frontier, Advanced Micro Devices (AMD) and Hewlett Packard Enterprise (HPE), reveal the electronic tricks that make the supercomputer tick.

Frontier consists of 74 HPE Cray EX supercomputing cabinets, each weighing more than 3,600 kilograms, which altogether hold more than 9,400 computing nodes. Each node contains one optimized third-generation AMD EPYC 64-core 2-gigahertz “Trento” processor for general tasks and four AMD instinct MI250x accelerators for highly parallel supercomputing and AI operations, as well as 4 terabytes of flash memory to help quickly feed the GPUs data. In total, Frontier contains 9,408 CPUs, 37,632 GPUs, and 8,730,112 cores, linked together by 145 kilometers of networking cables. The lab says its world-leading supercomputer consumes about 21 megawatts.

“Everyone up and down the line went after efficiency.”
—Brad McCredie, AMD

In May at the International Supercomputing Conference 2022 in Hamburg, Frontier revealed an overall performance of 1.1 exaflops, or 1.1 quintillion floating point operations per second, launching it to head of the Top500 list of the world’s most powerful supercomputers. It may grow even more powerful, with a theoretical peak performance of 2 exaflops.

In addition, Frontier is ranked first on the latest Green500 list, which measures supercomputing energy efficiency. (Which may not be an incidental point to its overall performance as the world’s fastest.) Whereas the previous top Green500 machine, MN-3 in Japan, delivered 39.38 gigaflops per watt, the Frontier test-and development system achieves 62.68 gigaflops per watt.

Moreover, Frontier won the top spot in a newer category, mixed-precision computing, which rates performance in computing formats commonly used for artificial intelligence. On the latest High-Performance Linpack-Accelerator Introspection or HPL-AI test, Frontier’s performance reached about 6.86 exaflops.

A key aspect of Frontier’s success is how its CPUs and GPUs are linked within each node via AMD’s Infinity Fabric interconnect architecture. This helps boost coherency between the CPU and GPUs—that is, giving them all the same view of shared data.

“Coherency is very important to getting you to scale performance,” says Brad McCredie, corporate vice president of data center GPU and accelerated processing at AMD in Austin. “It helps you make sure that you can run the right workloads on the right processors. It makes it very easy for CPUs to do small pieces of work and GPUs to do big pieces of work in parallel.”

During Frontier’s development, AMD noted the biggest challenge it faced was power performance. “There was a lot of documentation that it would take hundreds of thousands of GPUs and 150 to 500 MW to get to an exaflop, and we wanted to do it with tens of thousands of GPUs and 20 MW” McCredie says. “So everyone up and down the line went after efficiency.”

For example, Frontier’s GPUs each have 128 gigabytes of high-bandwidth memory soldered onto them. This helps them overcome a critical bottleneck to performance—the shuffling of data between memory and processing.

Moreover, Frontier’s GPUs each used the advanced 6-nanometer node from TSMC ( Taiwan Semiconductor Manufacturing Co.). Therefore, “they can execute double-precision floating-point operations as fast as single-precision floating-point operations, which was a big innovation,” McCredie says.

Frontier’s No. 1 ranking on the Green500 list may not be an incidental point either.

These seemingly inconsequential developments in fact helped Frontier rely on tens of thousands of GPUs rather than hundreds of thousands, “shifting the burden away from the programmer to the hardware when it comes to managing all that parallelism,” McCredie says. “That makes the system much more programmable.”

Two AMD nodes fit on a “compute blade,” and 64 such blades are loaded into each cabinet. The compute blades are linked together by HPE Slingshot interconnects, each with a custom-designed 64-port switch that provides 12.8 terabits per second of network bandwidth. Groups of blades are linked together via a so-called dragonfly topology in which hundreds of cabinets with hundreds of thousands of nodes can all communicate with just three hops at most between all nodes.

“Slingshot deployments are highly optimized to use the most energy-efficient cabling—direct attach copper and active optical cables— fitted to the distances required,” says Mike Woodacre, vice president and chief technical officer of HPE’s HPC and AI team. Eliminating less-efficient general-purpose components, he adds, “significantly reduces the energy the fabric consumes.”

The blades in the cabinets are chilled using liquid cooling. According to Gerald Kleyn, vice president of HPC and AI systems at HPE, the supercomputer can achieve up to five times the density of a traditional, air-cooled architecture. The result is a compact system that in turn dramatically reduces cabling requirements and operational expenses.

“Breaking the exaflop barrier was important, but doing so while achieving No. 1 on the Green500 list is remarkable,” says Kleyn. Moreover, accomplishing this in the midst of a pandemic and global supply-chain problems, he says, “took a herculean team effort between Oak Ridge National Laboratory, HPE, and AMD.”

An image of Frontier being installed at Oak Ridge National Laboratory.Despite challenges including pandemic-related supply-chain issues, delivery of the Frontier supercomputer system took place between September and November 2021. Carlos Jones/ORNL/U.S. Department of Energy

The next steps for Frontier include continued testing and validation of the system. The lab says it remains on track for final acceptance and early science access later in 2022 and is planned to open for full science at the beginning of 2023.

Projects already planned for Frontier include research into cancer, drug discovery, nuclear fusion, exotic materials, superefficient engines, and stellar explosions. The aim of the machine is to speed the time required for such work from weeks to hours and from hours to seconds.

“Frontier enable scientists to do more science, which means getting closer to more efficient cleaner-burning energy, more quickly finding even more effective vaccines for viruses,” McCredie says. “We started this whole adventure with Frontier to be the first to an exaflop, but seeing people at Oak Ridge working to solve problems in climate, energy, the pandemic, the top challenges facing humanity—we’ve gone from wanting to build a powerful computer to building something that will help everyone.”

This article appears in the August 2022 print issue as “Beneath the Hood of the First Exascale Computer.”

The Conversation (0)