The Case for Running AI on CPUs Isn’t Dead Yet

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

It’s time to give the humble CPU another crack at AI.

That’s the conclusion reached by a small but increasingly vocal group of AI researchers. Julien Simon, the chief evangelist of AI company Hugging Face, recently demonstrated the CPU’s untapped potential with Intel’s Q8-Chat, a large language model (LLM) capable of running on a single Intel Xeon processor with 32 cores. The demo offers a chat interface like OpenAI’s ChatGPT and responds to queries at blazing speeds that (from personal experience) leave ChatGPT eating dust.

GPU usage in AI development is so ubiquitous that it’s hard to imagine another outcome, but it wasn’t inevitable. Several specific events helped GPU hardware outmaneuver both CPUs and, in many cases, dedicated AI accelerators.

“Unlocking the massively parallel architecture of GPUs to train deep neural networks is one of the key factors that made deep learning possible,” says Simon. “GPUs were then quickly integrated in open-source frameworks like TensorFlow and PyTorch, making them easy to use without having to write complex low-level CUDA code.”

Compute Unified Device Architecture (CUDA) is an application programming interface (API) that Nvidia introduced in 2007 as part of its plan to challenge the dominance of CPUs. It was well established by the middle of the 2010s, providing TensorFlow and PyTorch a clear route to tap the power of Nvidia hardware. Hugging Face, as a central hub for the AI community that (among other things) provides an open-source Transformers library compatible with TensorFlow and PyTorch, has played a role in CUDA’s growth, too.

A black rectangle is propped up in front of a white background. The rectangle has many coppery-colored circles and golden rectangles covering its surface. Nvidia’s A100 is a powerful tool for AI, but high demand has made the hardware tough to obtain.Nvidia

Yet Simon believes that “monopolies are never a good thing.” The GPU’s dominance may exacerbate supply-chain issues and lead to higher costs, a possibility underscored by Nvidia’s blowout Q1 2023 financial results, in which earnings rose 28 percent on the back of demand for AI. “It’s near impossible to get an [Nvidia] A100 on AWS or Azure. So, what then?” asks Simon. “For all these reasons, we need an alternative, and Intel CPUs work very well in many inference scenarios, if you care to do your homework and use the appropriate tools.”

The ubiquity of CPUs provides a workaround to the GPU’s dominance. A recent report from PC component market research firm Mercury Research found that 374 million x86 processors were shipped in 2022 alone. ARM processors are even more common, with over 250 billion chips shipped through the third quarter of 2022.

AI developers have largely ignored this pool of untapped potential, assuming that the CPU’s relative lack of parallel processing would be a poor fit for deep learning, which typically relies on numerous matrix multiplications performed in parallel. The rapid increase in AI model size, pushed by the success of models like OpenAI’s GPT-3 (175 billion parameters) and DeepMind’s Chinchilla (70 billion parameters) has worsened the problem.

“We are at the point where the fundamental dense matrix multiplications are becoming prohibitive, even with the co-evolved software and hardware ecosystem, for the size of models and datasets,” says Shrivastava Anshumali, the CEO and founder of ThirdAI.

GPU usage in AI development is so ubiquitous that it’s hard to imagine another outcome, but it wasn’t inevitable.

It doesn’t have to be that way. ThirdAI’s research has found that “more than 99 percent” of operations in existing LLMs return a zero. ThirdAI deploys a hashing technique to trim these unnecessary operations. “The hashing-based algorithms eliminated the need to waste any cycle and energy on the zeros that don’t matter,” says Anshumali.

His company recently demonstrated the potential of its technique with PocketLLM, an AI-assisted document-management app for Windows and Mac that can comfortably run on CPUs found in most modern laptops. ThirdAI also offers Bolt Engine, a Python API for training deep-learning models on consumer-grade CPUs.

PocketLLM’s neural search handles training and inference on consumer grade CPUs.ThirdAI

Hugging Face’s Q8-Chat takes a different tack, achieving its results through a model compression technique called quantization, which replaces 16-bit floating-point parameters with 8-bit integers. These are less precise but easier to execute and require less memory. Intel used a specific quantization technique, SmoothQuant, to reduce the size of several common LLMs, such as Meta’s LLaMA and OPT, by half. The public Q8-Chat demonstration is based on MPT-7B, an open-source LLM from MosaicML with 7 billion parameters.

Intel continues to develop AI optimizations for its upcoming Sapphire Rapids processors, which are used in the Q8-Chat demo. The company’s recent submission of MLPerf 3.0 results for Sapphire Rapids showed that the processor’s inference performance improvement in offline scenarios was over five times better compared to that of the prior generation, Ice Lake. Similarly, the performance improvement in server scenarios was 10 times better compared to Ice Lake’s. Intel also showed an up to 40 percent improvement over its prior submission for Sapphire Rapids, an uplift achieved through software and “workload-specific optimizations.”

This isn’t to say CPUs will now supplant GPUs in all AI tasks. Simon believes that “in general, smaller LLMs are always preferable,” but admits “there is no Swiss Army knife model that works well across all use cases and all industries.” Still, the stage looks set for an increase in CPU relevance. Anshumali is particularly bullish on this potential turn of fortune, seeing a need for small “domain specialized LLMs” tuned to tackle specific tasks. Both Simon and Anshumali say these smaller LLMs are not just efficient but also provide benefits in privacy, trust, and safety, as they eliminate the need to rely on a large general model controlled by a third party.

“We are building the capabilities to bring every core of CPUs out there to better the AI for the masses,” says Anshumali. “We can democratize AI with CPUs.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

The Case for Running AI on CPUs Isn’t Dead Yet

GPUs may dominate, but CPUs could be perfect for smaller AI models

Augmented Reality Slims Down With AI and Holograms

Brain-Inspired Computer Approaches Brain-Like Size

Engineering Needs More Futurists

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

AI Chip Trims Energy Budget Back by 99+ Percent

Faster, More Secure Photonic Chip Boosts AI Training

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

The Case for Running AI on CPUs Isn’t Dead Yet

GPUs may dominate, but CPUs could be perfect for smaller AI models

Augmented Reality Slims Down With AI and Holograms

Brain-Inspired Computer Approaches Brain-Like Size

Engineering Needs More Futurists

Related Stories

Llama 3 Establishes Meta as the Leader in “Open” AI

AI Chip Trims Energy Budget Back by 99+ Percent

Faster, More Secure Photonic Chip Boosts AI Training