The Case for Running AI on CPUs Isn’t Dead Yet

GPUs may dominate, but CPUs could be perfect for smaller AI models

4 min read
An Intel Xeon processor on a black backdrop. The processor is shown from both above and below, displaying the thousands of contacts that are used to connect the processor with a mainboard.

Advancements in LLM compression have drastically improved their performance on x86 processors.


It’s time to give the humble CPU another crack at AI.

That’s the conclusion reached by a small but increasingly vocal group of AI researchers. Julien Simon, the chief evangelist of AI company Hugging Face, recently demonstrated the CPU’s untapped potential with Intel’s Q8-Chat, a large language model (LLM) capable of running on a single Intel Xeon processor with 32 cores. The demo offers a chat interface like OpenAI’s ChatGPT and responds to queries at blazing speeds that (from personal experience) leave ChatGPT eating dust.

GPU usage in AI development is so ubiquitous that it’s hard to imagine another outcome, but it wasn’t inevitable. Several specific events helped GPU hardware outmaneuver both CPUs and, in many cases, dedicated AI accelerators.

“Unlocking the massively parallel architecture of GPUs to train deep neural networks is one of the key factors that made deep learning possible,” says Simon. “GPUs were then quickly integrated in open-source frameworks like TensorFlow and PyTorch, making them easy to use without having to write complex low-level CUDA code.”

Compute Unified Device Architecture (CUDA) is an application programming interface (API) that Nvidia introduced in 2007 as part of its plan to challenge the dominance of CPUs. It was well established by the middle of the 2010s, providing TensorFlow and PyTorch a clear route to tap the power of Nvidia hardware. Hugging Face, as a central hub for the AI community that (among other things) provides an open-source Transformers library compatible with TensorFlow and PyTorch, has played a role in CUDA’s growth, too.

A black rectangle is propped up in front of a white background. The rectangle has many coppery-colored circles and golden rectangles covering its surface.Nvidia’s A100 is a powerful tool for AI, but high demand has made the hardware tough to obtain.Nvidia

Yet Simon believes that “monopolies are never a good thing.” The GPU’s dominance may exacerbate supply-chain issues and lead to higher costs, a possibility underscored by Nvidia’s blowout Q1 2023 financial results, in which earnings rose 28 percent on the back of demand for AI. “It’s near impossible to get an [Nvidia] A100 on AWS or Azure. So, what then?” asks Simon. “For all these reasons, we need an alternative, and Intel CPUs work very well in many inference scenarios, if you care to do your homework and use the appropriate tools.”

The ubiquity of CPUs provides a workaround to the GPU’s dominance. A recent report from PC component market research firm Mercury Research found that 374 million x86 processors were shipped in 2022 alone. ARM processors are even more common, with over 250 billion chips shipped through the third quarter of 2022.

AI developers have largely ignored this pool of untapped potential, assuming that the CPU’s relative lack of parallel processing would be a poor fit for deep learning, which typically relies on numerous matrix multiplications performed in parallel. The rapid increase in AI model size, pushed by the success of models like OpenAI’s GPT-3 (175 billion parameters) and DeepMind’s Chinchilla (70 billion parameters) has worsened the problem.

“We are at the point where the fundamental dense matrix multiplications are becoming prohibitive, even with the co-evolved software and hardware ecosystem, for the size of models and datasets,” says Shrivastava Anshumali, the CEO and founder of ThirdAI.

GPU usage in AI development is so ubiquitous that it’s hard to imagine another outcome, but it wasn’t inevitable.

It doesn’t have to be that way. ThirdAI’s research has found that “more than 99 percent” of operations in existing LLMs return a zero. ThirdAI deploys a hashing technique to trim these unnecessary operations. “The hashing-based algorithms eliminated the need to waste any cycle and energy on the zeros that don’t matter,” says Anshumali.

His company recently demonstrated the potential of its technique with PocketLLM, an AI-assisted document-management app for Windows and Mac that can comfortably run on CPUs found in most modern laptops. ThirdAI also offers Bolt Engine, a Python API for training deep-learning models on consumer-grade CPUs.

PocketLLM’s neural search handles training and inference on consumer grade CPUs.ThirdAI

Hugging Face’s Q8-Chat takes a different tack, achieving its results through a model compression technique called quantization, which replaces 16-bit floating-point parameters with 8-bit integers. These are less precise but easier to execute and require less memory. Intel used a specific quantization technique, SmoothQuant, to reduce the size of several common LLMs, such as Meta’s LLaMA and OPT, by half. The public Q8-Chat demonstration is based on MPT-7B, an open-source LLM from MosaicML with 7 billion parameters.

Intel continues to develop AI optimizations for its upcoming Sapphire Rapids processors, which are used in the Q8-Chat demo. The company’s recent submission of MLPerf 3.0 results for Sapphire Rapids showed that the processor’s inference performance improvement in offline scenarios was over five times better compared to that of the prior generation, Ice Lake. Similarly, the performance improvement in server scenarios was 10 times better compared to Ice Lake’s. Intel also showed an up to 40 percent improvement over its prior submission for Sapphire Rapids, an uplift achieved through software and “workload-specific optimizations.”

This isn’t to say CPUs will now supplant GPUs in all AI tasks. Simon believes that “in general, smaller LLMs are always preferable,” but admits “there is no Swiss Army knife model that works well across all use cases and all industries.” Still, the stage looks set for an increase in CPU relevance. Anshumali is particularly bullish on this potential turn of fortune, seeing a need for small “domain specialized LLMs” tuned to tackle specific tasks. Both Simon and Anshumali say these smaller LLMs are not just efficient but also provide benefits in privacy, trust, and safety, as they eliminate the need to rely on a large general model controlled by a third party.

“We are building the capabilities to bring every core of CPUs out there to better the AI for the masses,” says Anshumali. “We can democratize AI with CPUs.”

The Conversation (1)
Fergus Kane
Fergus Kane02 Jun, 2023

Hmm. That 250 billion figure could do with a bit of clarification.