IBM’s New Telum Chip Reboots the Mainframe

Big Blue’s z16 computer—and the cache-savvy design at its core—gives new relevance to the platform

4 min read

A wafer of eight-core processors—IBM’s Telum chip

IBM’s Telum processor, shown here in its wafer state, contains eight cores clocked at over 5 gigahertz. And crucially, each core has its own 32-megabyte level 2 cache. Delivering on the promise of this innovation—at a system-wide scale—was one of the key challenges for IBM’s new z16 mainframe.

IBM

IBM recently launched its new line of mainframe computers, named IBM z16, although the event was not exactly front-page news, a few outbursts of media celebration aside. Yet the mainframe’s diminishing significance in the tech landscape today does not mean it’s dwindling away either. Consider IBM’s touting of the z16’s real-time AI processing of transactions for fraud. (IBM also dubs z16 the “industry’s first quantum-safe system.”)

These aggressive claims could arguably put the z16 on a path to reviving the mainframe platform as a whole. And at the core of all these capabilities is its silicon. The z16’s foundation, in fact, is IBM’s Telum chip, which itself was launched just last summer. The architecture of this chip makes possible the AI-enabled mainframes that IBM is pushing toward today. And at the heart of Telum’s allure is its novel approach to cache design.

Caches are a key design component on every microprocessor, with a huge impact on the overall performance of the processor. Caches are like storage lockers that enable data to be stored or, well, cached ahead of time at the processor’s front doorstep.

This is particularly critical in today’s processors that run so fast—often at many gigahertz speeds—that whenever the system needs to wait for data, precious clock cycles are just wasted.

While some in the tech press have identified Telum’s cache operations as “magic,” it’s really just smart engineering.

The generation of chips prior to the Telum chip employed four levels of cache: The closest level cache is level one (L1), the next level up is level two (L2), and so forth until you get to level four (L4). Traditionally, cache hierarchies have been built with L1, L2, and L3 on the chip, while L4 is off the chip. The size of each level grows as you go, and they also grow in latency the farther away they are from the processor engine itself.

With the new Telum chip, IBM has eliminated both the physical L3 and L4 caches. How they managed to do that is an engineering feat that took five years to realize.

“What we’ve done with the Telum chip is we’ve completely re-architected how these caches work to keep a lot more data a lot closer to the processor core than we have done in the past,” said Christian Jacobi, IBM Fellow and chief technology officer of system architecture and design for IBM zSystems. “To do this, we’ve quadrupled the level two cache. We now have a 32-MB level-two cache.”

This size for an L2 cache stands in stark contrast to most other server chips that are on the order of maybe half a megabyte or 1 MB of cache. For this much larger L2 cache to work effectively, IBM has optimized the access patterns and how the processor core can get to that very large 32 MB of cache (256 MB across eight cores) so that it has extremely low latencies.

“We’re no longer limited by the control logic figuring out where the data actually is in the cache and then getting the request to the right sector of the cache and slowly move the data over,” said Jacobi. “We’re designing it such that we are truly limited only by the electrical transmission delays to get to the data, to trigger the read of the data out of the array, and then flow it back to where it’s needed.”

“We’re trying to get every core what they need, but then what they don’t need becomes available as discretionary space.”
—Christian Jacobi, IBM

In terms of actual numbers, the Telum chip can flow data in a best-case latency of less than 3 nanoseconds, and at an average latency of 3.6 nanoseconds. “We’ve optimized this access pipeline and through that, we have created a huge performance benefit for us,” added Jacobi.

Jacobi and his team saw the benefits of eliminating the physical L3 and L4 caches by increasing the size of the L2 cache, but they still wanted the performance benefits that came with the additional storage space inherent in those L3 and L4 caches. In order to maintain those benefits, they decided to reshape and redefine how the caches they do have interact.

So they began with the realization that—in a chip with eight cores, each with its own cache—not every core is equally busy all the time. Instantaneous shifts occur in how each core’s workload entails the use of its own private cache.

“That operation becomes a huge opportunity for us,” explained Jacobi. “If one core is very busy on its L2 and needs actually more than 32 megabytes, and another core on the chip uses less cache, I can use the other L2 on the chip as a spillover place for the very busy caches."

This is how Jacobi and his team arrived at the concept of a virtual L3 cache. The virtual L3 cache balanced underused L2 cache as a spillover place for other, overbooked L2s. When data needs to be accessed, it can still be retrieved from 32 MB of cache very, very close to the core. And in fact, the spillover space on the chip thus effectively amounts to a maximum of 32 x 8 = 256 MB. If the other cores are not doing anything, all of their cache is theoretically available as spillover for a single core—which would now have a still very low latency of only 12 nanoseconds.

“We're trying to get every core what they need, but then what they don’t need becomes available as discretionary space for the cores that could benefit from it,” said Jacobi. “Because we still optimized to have very low latency, effectively this ends up like an L3, but it’s composed out of the physical L2 caches instead of being its own piece of silicon area on the chip.”

While some in the tech press have identified this operation as “magic,” it’s really just smart engineering. Jacobi explained that the chip regularly measures itself on how busy each cache is. And with the heuristics IBM’s built into Telum, it can determine that in the last microsecond it was very active on this cache and less active on the other cache. Armed with that information, the chip would then redirect traffic and use a neighboring cache as a spillover. All the while, it continues to measure how active its cache-shuffling makes other caches.

Jacobi noted that this architecture also provides a clean way to allow special-purpose hardware accelerators integrated in the mainframe—such as the AI accelerators that enable real-time fraud detection—to access the data in the CPU caches.

The Conversation (0)