The machine learning field is moving fast, and the yardsticks used to measure its progress are having to race to keep up. A case in point: MLPerf, the biannual machine learning competition sometimes termed “the Olympics of AI,” has introduced three new benchmark tests, reflecting new directions in the field.
“Lately, it has been very difficult trying to follow what happens in the field,” says Miro Hodak, an Advanced Micro Devices engineer and MLPerf Inference working-group cochair. “We see that the models are becoming progressively larger, and in the last two rounds we have introduced the largest models we’ve ever had.”
The chips that tackled these new benchmarks came from the usual suspects—Nvidia, Arm, and Intel. Nvidia topped the charts, introducing its new Blackwell Ultra GPU, packaged in a GB300 rack-scale design. AMD put up a strong performance, introducing its latest MI325X GPUs. Intel proved that one can still do inference on CPUs with its Xeon submissions, but it also entered the GPU game with an Intel Arc Pro submission.
New benchmarks
Last round, MLPerf introduced its largest benchmark yet, a large language model based on Llama 3.1-403B. In this round, MLPerf topped itself yet again, introducing a benchmark based on the DeepSeek-R1 671B model—more than 1.5 as many parameters as the previous largest benchmark.
As a reasoning model, DeepSeek-R1 goes through several steps of chain-of-thought prompting when approaching a query. This means that much more of the computation happens during inference than in normal LLM operation, making this benchmark even more challenging. Reasoning models are claimed to be the most accurate, making them the technique of choice for science, math, and complex programming queries.
In addition to the largest LLM benchmark yet, MLPerf also introduced the smallest, based on Llama 3.1-8B. There is growing industry demand for low latency yet high-accuracy reasoning, explained Taran Iyengar, the MLPerf Inference task-force chair. Small LLMs can supply this, and they’re an excellent choice for tasks such as text summarization and edge applications.
This brings the total count of LLM-based benchmarks to a confusing four. They include the new, smallest Llama 3.1-8B benchmark; a preexisting Llama 2-70B benchmark; last round’s introduction of the Llama 3.1-403B benchmark; and the largest, the new DeepSeek-R1 model. If nothing else, this signals that LLMs are not going anywhere.
In addition to the myriad LLMs, this round of MLPerf Inference included a new voice-to-text model, based on Whisper-large-v3. This benchmark is a response to the growing number of voice-enabled applications, whether they’re smart devices or speech-based AI interfaces.
The MLPerf Inference competition has two broad categories: “closed,” which requires using the reference neural-network model as-is without modifications, and “open,” where some modifications to the model are allowed. Within those, there are several subcategories related to how the tests are done and in what sort of infrastructure. We will focus on the “closed” data-center server results for the sake of sanity.
Nvidia leads
Surprising no one, the best performance per accelerator on each benchmark, at least in the server category, was achieved by an Nvidia GPU-based system. Nvidia also unveiled the Blackwell Ultra, topping the charts in the two largest benchmarks: Llama 3.1-405B and DeepSeek-R1 reasoning.
Blackwell Ultra is a more-powerful iteration of the Blackwell architecture, featuring significantly more memory capacity, double the acceleration for attention layers, 1.5 times more AI compute, and faster memory and connectivity compared with the standard Blackwell. It is intended for larger AI workloads, like the two benchmarks it was tested on.
In addition to the hardware improvements, Dave Salvator, director of accelerated computing products at Nvidia, attributes the success of Blackwell Ultra to two key changes. First, the use of Nvidia’s proprietary 4-bit floating-point number format, NVFP4. “We can deliver comparable accuracy to formats like BF16,” Salvator says, while using a lot less computing power.
The second is so-called disaggregated serving. The idea behind disaggregated serving is that there are two main parts to the inference workload: prefill, where the query (“Please summarize this report”) and its entire context window (the report) are loaded into the LLM, and generation/decoding, where the output is actually calculated. These two stages have different requirements. While prefill is compute heavy, generation/decoding is much more dependent on memory bandwidth. Salvator says that by assigning different groups of GPUs to the two different stages, Nvidia achieves a performance gain of nearly 50 percent.
AMD close behind
AMD’s newest accelerator chip, MI355X, launched in July. The company offered results only in the “open” category, where software modifications to the model are permitted. Like Blackwell Ultra, MI355X features 4-bit floating-point support, as well as expanded high-bandwidth memory. The MI355X beat its predecessor, the MI325X, in the open Llama 2.1-70B benchmark by a factor of 2.7, says Mahesh Balasubramanian, senior director of data-center GPU product marketing at AMD.
AMD’s “closed” submissions included systems powered by AMD MI300X and MI325X GPUs. The more advanced MI325X computer performed similarly to those built with Nvidia H200s on the Llama 2-70b, the “mixture of experts” test, and image-generation benchmarks.
This round also included the first hybrid submission, where both AMD MI300X and MI325X GPUs were used for the same inference task, the Llama 2-70b benchmark. The use of hybrid GPUs is important, because new GPUs are coming at a yearly cadence, and the older models, deployed en masse, are not going anywhere. Being able to spread workloads among different kinds of GPUs is an essential step.
Intel enters the GPU game
In the past, Intel has remained steadfast that one does not need a GPU to do machine learning. Indeed, submissions using Intel’s Xeon CPU still performed on par with the Nvidia L4 on the object-detection benchmark but trailed on the recommender-system benchmark.
In this round, for the first time, an Intel GPU also made a showing. The Intel Arc Pro was first released in 2022. The MLPerf submission featured a graphics card called the MaxSun Intel Arc Pro B60 Dual 48G Turbo, which contains two GPUs and 48 gigabytes of memory. The system performed on par with Nvidia’s L40S on the small LLM benchmark and trailed it on the Llama 2-70b benchmark.
Dina Genkina is an associate editor at IEEE Spectrum focused on computing and hardware. She holds a PhD in atomic physics and lives in Brooklyn.


