Scientific supercomputing is not immune to the wave of machine learning that's swept the tech world. Those using supercomputers to uncover the structure of the universe, discover new molecules, and predict the global climate are increasingly using neural networks to do so. And as is long-standing tradition in the field of high-performance computing, it's all going to be measured down to the last floating-point operation.
Twice a year, Top500.org publishes a ranking of raw computing power using a value called Rmax, derived from benchmark software called Linpack. By that measure, it's been a bit of a dull year. The ranking of the top nine systems are unchanged from June, with Japan's Supercomputer Fugaku on top at 442,010 trillion floating point operations per second. That leaves the Fujitsu-built system a bit shy of the long-sought goal of exascale computing—one-million trillion 64-bit floating-point operations per second, or exaflops.
But by another measure—one more related to AI—Fugagku and its competitor the Summit supercomputer at Oak Ridge National Laboratory have already passed the exascale mark. That benchmark, called HPL-AI, measures a system's performance using the lower-precision numbers—16-bits or less—common to neural network computing. Using that yardstick, Fugaku hits 2 exaflops (no change from June 2021) and Summit reaches 1.4 (a 23 percent increase).
By one benchmark, related to AI, Japan's Fugaku and the U.S.'s Summit supercomputers are already doing exascale computing.
But HPL-AI isn't really how AI is done in supercomputers today. Enter MLCommons, the industry organization that's been setting realistic tests for AI systems of all sizes. It released results from version 1.0 of its high-performance computing benchmarks, called MLPerf HPC, this week.
The suite of benchmarks measures the time it takes to train real scientific machine learning models to agreed-on quality targets. Compared to MLPerf HPC version 0.7, basically a warmup round from last year, the best results in version 1.0 showed a 4- to 7-fold improvement. Eight supercomputing centers took part, producing 30 benchmark results.
As in MLPerf's other benchmarking efforts, there were two divisions: "Closed" submissions all used the same neural network model to ensure a more apples-to-apples comparison; "open" submissions were allowed to modify their models.
The three neural networks trialed were:
- CosmoFlow uses the distribution of matter in telescope images to predict things about dark energy and other mysteries of the universe.
- DeepCAM tests the detection of cyclones and other extreme weather in climate data.
- OpenCatalyst, the newest benchmark, predicts the quantum mechanical properties of catalyst systems to discover and evaluate new catalyst materials for energy storage.
In the closed division, there were two ways of testing these networks: Strong scaling allowed participants to use as much of the supercomputer's resources to achieve the fastest neural network training time. Because it's not really practical to use an entire supercomputer-worth of CPUs, accelerator chips, and bandwidth resources on a single neural network, strong scaling shows what researchers think the optimal distribution of resources can do. Weak scaling, in contrast, breaks up the entire supercomputer into hundreds of identical versions of the same neural network to figure out what the system's AI abilities are in total.
Here's a selection of results:
Argonne National Laboratories used its Theta supercomputer to measure strong scaling for DeepCAM and OpenCatalyst. Using 32 CPUs and 129 Nvidia GPUs, Argonne researchers trained DeepCAM in 32.19 minutes and OpenCatalyst in 256.7 minutes. Argonne says it plans to use the results to develop better AI algorithms for two upcoming systems, Polaris and Aurora.
The Swiss National Supercomputing Centre used Piz Daint to train OpenCatalyst and DeepCAM. In the strong scaling category, Piz Daint trained OpenCatalyst in 753.11 minutes using 256 CPUs and 256 GPUs. It finished DeepCAM in 21.88 minutes using 1024 of each. The center will use the results to inform algorithms for its upcoming Alps supercomputer.
Fujitsu and RIKEN used 512 of Fugaku's custom-made processors to perform CosmoFlow in 114 minutes. It then used half of the complete system—82,944 processors—to perform the weak scaling benchmark on the same neural network. That meant training 637 instances of CosmoFlow, which it managed to do at an average of 1.29 models per minutes for a total of 495.66 minutes (not quite 8 hours).
Helmholtz AI, a joint effort of Germany's largest research centers, tested both the JUWELS and HoreKa supercomputers. HoreKa's best effort was to chug through DeepCAM in 4.36 minutes using 256 CPUs and 512 GPUs. JUWELS did it in as little as 2.56 minutes using 1024 CPUs and 2048 GPUs. For CosmoFlow, its best effort was 16.73 minutes using 512 CPUs and 1024 GPUs. In the weak scaling benchmark JUWELS used 1536 CPUs and 3072 GPUs to plow through DeepCAM at rate of 0.76 models per minute.
Lawrence Berkeley National Laboratory used the Perlmutter supercomputer to conquer CosmoFlow in 8.5 minutes (256 CPUs and 1024 GPUs), DeepCAM in 2.51 minutes (512 CPUs and 2048 GPUs), and OpenCatalyst in 111.86 minutes (128 CPUs and 512 GPUs). It used 1280 CPUs and 5120 GPUs for the weak scaling effort, yielding 0.68 models per minute for CosmoFlow and 2.06 models per minute for DeepCAM.
The (U.S.) National Center for Supercomputing Applications did its benchmarks on the Hardware Accelerated Learning (HAL) system. Using 32 CPUs and 64 GPUs they trained OpenCatalyst in 1021.18 minutes and DeepCAM in 133.91 minutes.
Nvidia, which made the GPUs used in every entry except Riken's, tested its DGX A100 systems on CosmoFlow (8.04 minutes using 256 CPUs and 1024 GPUs) and DeepCAM (1.67 minutes with 512 CPUs and 2048 GPUs). In weak scaling the system was made up of 1024 CPUs and 4096 GPUs and it plowed through 0.73 CosmoFlow models per minute and 5.27 DeepCAM models per minute.Texas Advanced Computing Center's Frontera-Longhorn system tackled CosmoFlow in 140.45 minutes and DeepCAM in 76.9 minutes using 64 CPUs and 128 GPUs.
Editor's note 1 Dec 2021: This post incorrectly defined exaflop as "one-thousand trillion 64-bit floating-point operations per second." It now correctly defines it as one-million trillion flops per second.